Contributions: Cinthia Tanaka, Edesio Alcobaça, Felipe Almeida, Kevin Rossell

One thing is certain: generative AI and Large Language Models (LLMs) have changed the business world forever. And, without a doubt, customer service is one of the most affected areas (and still with a lot of potential for development).

Purple MinDS is a new blog series by Building Nubank, in partnership with Nu’s Data Scientists and Machine Learning Engineers. In the first edition, you can check out our conversation with Kevin Rossell, Nubank’s Lead Data Scientist who’s been deep in the trenches of customer support for over two years. 

We sat down with Kevin to talk about the “Agent Copilot” project — a set of tools designed to supercharge Customer Experience (CX) agents with the power of LLMs. This project isn’t just about tech: it’s about rethinking how we empower agents to do their best work.

In this conversation, Kevin takes us through the project’s inception, the challenges they faced, and how generative AI stacks up against traditional machine learning methods. If you’re curious about what it takes to bring generative AI into the real world of customer support, you’re in the right place. Keep reading this article!

What is the “Agent Copilot” project?

The agent copilot is a set of tools for CX agents; they are meant to empower these agents to better answer customers’ questions, using LLMs. 

This is done through suggesting answers and actions, and providing a summary of interaction history for agents while they are interacting with customers.

Check our job opportunities

How did the project start and how did you prove its value?

This project was born from a brainstorming session at the end of 2022. At first, I created a Request for Comments (RFC) of the idea, exploring some possibilities and giving visibility of the challenges. We started generating a few samples and evaluating if the suggestions made any sense. 

After this initial validation, we investigated how to integrate it in the backoffice tool used by the customer support agents. We chose to start with the most capable LLM as we wanted to remove any possible technology-related doubts like:”is it not working because the model isn’t good enough, or because we didn’t test with the best available model?”. This decision generated cost-related discussions in our first large-scale experiment, but it wasn’t a huge problem as LLM provider costs are decreasing every couple of months.

What is the difference between this “Agent Copilot” solution and other solutions built to directly answer the customers?

I would say both solutions share a common ground, but the copilot tool allows for “riskier” suggestions because we have the human-in-the-loop component: agents can decide whether or not to accept the suggestion, and they can also edit the suggestion. 

There is also a difference in the content to which the solution has access. In the Copilot, we can use all the internal information while in a customer-facing solution, that would likely be restricted to content that is already open to all customers.

How are you “personalizing” off-the-shelf models with Nubank-specific information?

Retrieval Augmented Generation (RAG), which is basically ingesting your knowledge and inserting it into the prompt, is the industry standard. We currently use a mixed approach with RAG and in-context learning for answer suggestions, where we show examples of what is expected inside the prompt and these examples change dynamically due to a retrieval step prior to querying the LLM. We made some tests including metadata as well as the conversation in the retrieval step, but we didn’t see a big improvement, so we are keeping things simpler for now. 

Fine tuning is also a possibility, but there are other challenges associated with it, as we would need to have input/output data examples and, in the end, this is yet another model to maintain. There are other state-of-the-art techniques used to inject knowledge into your system, like Lamini, for example. But I wouldn’t start from there because it’s actually a close-source third party and requires fine-tuning at the very beginning.

What are the main differences when working with generative AI vs working with “traditional” ML (tree-based classification, etc)?

The evaluation process is very different for generative AI applications. The first challenge is building the evaluation dataset. Another challenge is how to evaluate it, which metrics to use. 

In the first proof-of-concept, we used manual labelers who were asked if they liked the suggestion or not. Nowadays, we rely on traditional metrics for NLP/NLU tasks like translation and summarization, such as Rouge, Jaccard similarity metrics, as well as Ragas, which is LLM-based. But we know they also have some limitations.

How is quality measured in the Agent Copilot?

We collect agents’ feedback about the tool in different ways according to the task. For summary generation, we directly collect feedback from agents in the interface. In that case, they can vote if the answer was useful or not. Although this may seem the best solution for measuring quality, we may overload agents and negatively impact their productivity. 

In order to avoid impacting the users’ work, we also leverage implicit feedback as a way to understand how useful our applications are. For answer suggestions, for example, we measure how much our suggestions are being used or even how much they are edited.

Any tips on how to apply generative AI to a new use case?

I’ve been working full time on this project, together with a group of Engineers and Business Analysts. Things like this can be very time-consuming, so the suggestion is to start small: first, evaluate the use case to see if the task is too ambiguous and how complex it is. 

Understanding the use case in detail also helps you to choose which model to start with. Every LLM provider (OpenAI, Anthropic, Mistral or other) has categories of models (medium-tier, lower-tier…) according to their capabilities. For example, for classification you could use a lower-tier model with a good performance but for math problems you may need the most capable one.

A glimpse into the future

The Agent Copilot project is a glimpse into the future of customer support, where generative AI isn’t just a buzzword but a practical tool that enhances the way we work. Kevin’s journey with this project shows us that implementing such technology isn’t just about having the best model — it’s about thoughtful execution, ongoing learning, and a clear focus on the end-user.

For those looking to bring generative AI into their back office tool for customer support, Kevin’s advice is simple: start with a clear use case, keep it focused, and be prepared to iterate. The tech will keep evolving, but the principles of building something that genuinely helps people remain the same. Whenever the development of a product or a feature is about making a difference in the day-to-day lives of customer support agents and the customers they serve, companies will be on the right track.

Check our job opportunities