At Nubank, innovation in machine learning (ML) and data science (DS) drives our mission to build the Purple Future. Recently, we hosted the 92nd edition of our DS & ML Meetup, themed “Practices to Scale Machine Learning Operations.” This event featured a deep dive into the technical challenges and solutions of building real-time ML systems, with a focus on fraud detection—a domain where speed, accuracy, and scalability are critical.

Led by Otávio Vasques, Lead Machine Learning Engineer, the session explored the architecture, optimization strategies, and deployment practices behind Nubank’s real-time ML models. Key topics included the differences between batch and real-time models, the role of the Model Server, techniques to reduce latency and infrastructure costs, and best practices for testing and deploying models using shadow mode.

In this article, we’ll unpack these insights, offering a behind-the-scenes look at how Nubank scales machine learning operations to protect millions of customers. Whether you’re a data scientist, ML engineer, or simply curious about real-time ML, this post provides actionable lessons for building robust, scalable systems. Let’s dive in!

What are real-time models?

Real-time models differ from traditional batch models in one crucial way: they operate within the infrastructure of services, not just data pipelines. While batch models process large datasets overnight and generate predictions for the next day, real-time models respond to events as they happen. 

This is essential for use cases like fraud detection, where delaying a decision by even a few seconds could mean the difference between stopping a fraudulent transaction or letting it go through.

For example, if someone steals your credit card and tries to make a purchase, a real-time model can flag the transaction immediately, whereas a batch model would only catch it the next day—long after the damage is done.

Check our job opportunities

The architecture of real-time models at Nubank

At Nubank, our real-time models are built on a robust architecture that ensures low latency and high reliability. Here’s how it works:

  1. Model Server: This is the backbone of our real-time ML infrastructure. It’s a combination of Clojure (for interfacing with Nubank’s internal systems) and Python (for running the ML models). The Clojure layer handles authentication, security, and logging, while the Python layer focuses on inference.
  2. Data Flow: In a batch setup, data flows through Directed Acyclic Graphs (DAGs), where tables are processed sequentially. In real-time, we replace these tables with services. Instead of interrupting the flow to process data, we send events to the model server, which returns predictions in milliseconds.
  3. Feature Engineering: Real-time models require features to be computed on the fly. This often involves querying multiple services to gather the necessary data. For example, to detect fraud, we might need to pull customer information, transaction history, and account details—all in real time.
  4. Synchronous vs. Asynchronous Processing: Depending on the use case, real-time models can operate synchronously (responding immediately) or asynchronously (processing events with a slight delay). Fraud detection, for instance, requires synchronous processing to block suspicious transactions instantly.

Optimizing for scale and speed

Real-time models are resource-intensive, especially when operating at Nubank’s scale. Here are some of the techniques we use to optimize performance:

1. Fragmented vs. Global Deployment

  • Fragmented Deployment: Traditionally, we deployed models in a fragmented manner, with separate instances for different customer segments (or “shards”). While this ensures high availability, it can lead to resource inefficiencies, especially for models that only serve a small subset of customers.
  • Global Deployment: By deploying models globally (i.e., serving all shards from a single set of pods), we’ve reduced infrastructure costs by up to 30% while improving stability. This approach also simplifies scaling, as we no longer need to maintain redundant pods for each shard.

2. Pre-Policy Filtering

  • Not every event requires a model’s attention. By implementing pre-policy rules, we can filter out low-risk events before they even reach the model. For example, in fraud detection, we’ve reduced the number of transactions processed by a model from 2,800 per second to just 20 per second—saving significant computational resources.

3. Parallelizing Feature Retrieval

  • Real-time models often depend on multiple services to compute features. To minimize latency, we parallelize these requests using tools like Clojure’s future and delay constructs. This ensures that feature retrieval happens concurrently, rather than sequentially.

4. Monitoring and Timeouts

  • We closely monitor the response times of all dependencies. If a service starts to degrade, we set timeouts to prevent it from dragging down the entire model. This ensures that we can still make decisions—even if some features are missing.

Building reliable feature pipelines

Feature engineering is a critical part of any ML model, but it’s especially challenging in real-time systems. Here’s how we ensure consistency and reliability:

  1. Separating short-term and long-term features:
  • Short-term features (e.g., transactions from the last 24 hours) are computed in real time and stored in a “hot” database.
  • Long-term features (e.g., transactions from the last 90 days) are precomputed in batch and served via a feature store.
  1. Feature stores:
  • We use tools like Conrado (a serving layer for batch features) and Feature Store (a Spark Streaming application) to ensure consistency between batch and real-time pipelines. These tools also simplify feature retrieval and reduce the risk of errors.
  1. Custom infrastructure: 
  • When existing tools aren’t sufficient, we build our own infrastructure. For example, we’ve created dedicated services to store risk events (e.g., failed login attempts) and transaction histories, enabling us to compute complex features in real time.

Testing and shadow mode

Deploying real-time models requires rigorous testing to ensure they perform as expected. Here’s our approach:

  1. Integration testing: We run extensive integration tests to cover edge cases, such as empty inputs, incorrect data types, and unexpected errors.
  2. Shadow mode: Before fully deploying a model, we run it in “shadow mode,” where it processes real data but doesn’t affect customer-facing decisions. This allows us to validate the model’s performance and ensure it meets our SLAs (e.g., 700ms for PIX transactions).

Final thoughts

Building real-time ML models is a complex but rewarding challenge. At Nubank, we’ve learned that success depends on a combination of robust architecture, careful optimization, and collaboration across teams. While the techniques we’ve developed are tailored to fraud detection, many of the principles—like parallelizing feature retrieval, monitoring dependencies, and testing rigorously—can be applied to other real-time use cases.

As was emphasized during the meetup, not every model needs every optimization. The key is to be critical about your use case, understand your constraints, and focus on the improvements that will deliver the most value. And remember: real-time ML is a team effort. It takes data scientists, engineers, and analysts working together to build systems that are both fast and reliable.

Check our job opportunities