most read
Software Engineering
Why We Killed Our End-to-End Test Suite Sep 24
Culture & Values
The Spark Of Our Foundation: a letter from our founders Dec 9
Software Engineering
The value of canonicity Oct 30
Careers
We bring together great minds from diverse backgrounds who enable discussion and debate and enhance problem-solving.
Learn more about our careers



Written by: : Aman Gupta, Daniel Braithwaite
In the high-stakes arena of global finance, the margin for error is zero. At Nu, we serve 127 million customers across Brazil, Mexico, and Colombia. For a neo-bank of this scale, the “chatbot” era is over. Our mission is to move beyond simple FAQ retrieval and toward autonomous systems capable of handling complex debt renegotiation, card logistics, and fraud prevention—all while maintaining the absolute trust of a customer base that expects instant, empathetic, and factually perfect support.
My perspective is shaped by over a decade at Apple, Amazon, and LinkedIn, along with a deep commitment to the research community, having published at NeurIPS, ICML, and ICLR. As the creator of Quant Ease—the first quantization algorithm at scale utilizing coordinate descent—and a researcher in LLM preference alignment, I’ve learned that the “magic” of AI agents is actually an infrastructure problem. To build agents that work at a hundred-million-user scale, you must abandon prompt “wizardry” in favor of architectural rigor.
Here are the five hardest lessons we’ve learned building production-grade AI agents.
1. The “Evals-First” Approach: From NPS to TNPS
Traditional metrics are a death trap for AI development. Most organizations rely on the Net Promoter Score (NPS) to judge success, but for high-volume support, NPS is a nightmare. Response rates are notoriously low, feedback is delayed by days, and the sample size lacks the statistical power required for rapid A/B testing.
To scale, you must pivot to TNPS (Transactional Net Promoter Score)—measuring the specific AI interaction immediately—and supplement it with an “LLM-as-a-judge” framework. Instead of waiting for a survey that may never come, we use a secondary, auditor LLM to evaluate every single conversation against binary criteria:
By moving from a tiny sample of surveys to a 100% audit of conversations, we can iterate on variants daily. As the saying goes: “If you cannot measure it, you cannot improve it. So focus on measurement—be fanatical about it.”
Check our job opportunities
2. Redefining the Agent: The React Paradigm
The term “agent” is currently drowning in industry fluff. In a production environment, an agent is not a text-generator; it is an autonomous system following the React (Reasoning and Acting) paradigm. It is a brain with a toolbelt that solves tasks end-to-end.
Architecturally, we define the agent through three distinct layers:
If your system cannot trigger a real-world action—like re-ordering a lost card or updating a credit limit—it is a search interface, not an agent.
3. Stop Writing Prompts: The Era of Prompt Optimization
One of our most counter-intuitive findings is that humans are fundamentally bad at writing prompts for production. Manual prompting is “brittle” and model-dependent. We’ve observed a phenomenon where transitioning to next-generation frontier models (like moving from GPT-4 to newer, more capable reasoning models) can actually break existing agents. This is because smarter models “assume a lot less”—they lose the implicit helpfulness (and the implicit assumptions) your prompt was relying on.
To solve this, we utilize Prompt Optimization and Semantic Versioning. We treat prompts like code, segmenting them into modules (e.g., Tone, Tooling, Safety) that evolve independently. We leverage optimization frameworks like DSPy and Berkeley’s Japa optimizer. The workflow shifts the human’s role:
If you are writing five-page prompts by hand, you are building a system that will break the moment the model provider updates their weights.
4. Don’t Fine-Tune Unless You’re Chasing the Last 5%
There is a common urge to rush into Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). While my research focuses on preference alignment, my strategic advice is the opposite: Do not fine-tune until you have exhausted the capabilities of Frontier Models.
Your real competitive moat is not in the model weights; it is in your Data and Action layers. Fine-tuning introduces significant architectural risks:
Unless you are optimizing for the final 5% of accuracy or specific latency requirements on high-volume, narrow tasks, your engineering hours are better spent building robust tools.
5. Move Logic to Tools: LLM Reasoning is Not Business Logic
Hallucinations are often a symptom of asking the LLM to do deterministic work. If a process has a fixed sequence, do not ask the LLM to “reason” through it.
Consider our Card Delivery case study. Initially, we tried to let the LLM chain three tools: check status \rightarrow verify address \rightarrow trigger re-order. The LLM would frequently hallucinate the sequence or fail mid-chain. We realized we were being “idiots” by making the LLM do the work of a script.
The solution was to build a Composite Tool: a single, deterministic API that handles the entire three-step logic internally. The LLM only needs to decide to “Fix Card Delivery.”
“If it’s deterministic, don’t make the LLM do the work.”
By offloading business logic into the tools, you reduce the reasoning load on the LLM, slash hallucination rates, and ensure the agent remains a reliable interface for your backend services.
The Blueprint: Automated Simulation and Bug-Bashing
Before an agent variant reaches a single customer, it enters a “Bug-Bashing” environment. We have moved away from manual QA toward Automated Simulation and Red-Teaming. Using our internal dashboard and trace system, we simulate thousands of conversations to “attack” the agent.
We monitor the “traces”—the full interaction chain of LLM messages, customer inputs, and tool calls. We use simulation partners to stress-test the agent in ways a human tester might not anticipate, ensuring the agent won’t leak PII or deviate from its financial mandate before it goes live.
Conclusion: The Future of Autonomous Finance
AI agents are not here to replace our human “X-Peers”; they are here to elevate them. By resolving high-volume, predictable cases like FAQ retrieval and card tracking, agents empower our experts to dedicate their expertise to the complex challenges that require deep financial empathy.
This transition, however, demands more than just better models—it requires a fundamental shift in mindset. In banking, where trust takes years to build but seconds to lose, the “magic” of autonomy must be anchored by architectural rigor. Success in this new era belongs to those who treat customer trust as their most valuable asset, ensuring that every automated interaction is governed by a fanatical, systematic measurement of excellence.
Check our job opportunities