most read
Software Engineering
Why We Killed Our End-to-End Test Suite Sep 24
Software Engineering
The value of canonicity Oct 30
Culture & Values
The Spark Of Our Foundation: a letter from our founders Dec 9
Careers
We bring together great minds from diverse backgrounds who enable discussion and debate and enhance problem-solving.
Learn more about our careers



Written by Felipe Almeida and reviewed by Rubens Bolgheroni
Machine learning (ML) is powerful—but expensive and risky. It’s powerful because it’s often the only scalable solution to complex problems. It’s expensive because the resources (specialized professionals and computing power) are expensive. And it’s risky because models break in weird and sometimes silent ways: data drift or broken features will not trigger exceptions, even though model predictions will most likely be garbage.
We wrote about the uses of real-time ML in the past. But before you can think about using ML to help your organization, you need a project to take it from an idea into a functioning piece of software, delivering results. And you do not want the project to fail, as it’s money and time that could have been spent somewhere else in the business. This, of course, assumes you do need an ML model in the first place.
Now, even if an ML project does fail, it’s better if it fails quickly—before you have invested too much time and money into it. Figure 1 below shows the different outcomes for a project, depending on whether it delivered value to the business and the time it took to complete.
Now why do ML projects fail?
There are many reasons why: data problems and misaligned expectations between clients and the modeling team, to name a few.
Real-time ML projects, however, elevate the stakes: they have even more failure modes than regular (non-real-time) projects. In addition to all the risks inherent to ML, real-time models are software systems that need to integrate with other real-time systems via APIs, they are often subject to strict response times and need to keep parity with the training environment.
In this post we will show what a typical real-time ML project looks like and then go over the most common ways these projects can fail. Finally, we will provide practical instructions on how to address, or de-risk each of those points.
Typical project timeline
As mentioned previously, a project is the process to take an idea and make it reality. In the case of real-time ML, this includes choosing a business problem, understanding if and how it can be solved with ML and, finally, creating a model and integrating it with the underlying business IT infrastructure.
The stages of a typical real-time ML project can be summarized as follows:
Bear in mind that the steps above are not necessarily linear. That is, they need not happen one after the other.
For example, one may need to go back to the ideation stage during the project, to revisit assumptions that turned out to be false. Also, data scientists may need to retrain the model if there’s a problem with a feature which now needs to be dropped. Figure 2 below presents a visual summary:
There are two types of real-time ML projects: introducing a new model and updating an existing model:
This distinction is useful because updating an existing model is less risky than introducing a new model into a business flow. If the way the model is used is the same, you can skip the use-case validation stage—which is where much of the risk lies.
Regardless of whether we are introducing a new model or updating an existing one, all stages of the project pose risks. Let’s now see some of the ways in which real-time ML projects fail.
Check our job opportunities
Real-time ML Failure modes
As mentioned in the introduction, there are levels to failure in ML projects. Having a project fail because the model performance is lacking is bad. But having a project fail after investing significant time into it is much, much worse.
Each stage in a ML project has its unique vulnerabilities. This is especially true of real-time projects, as model deployment and service integration present an additional layer of complexity. Successful real-time ML projects acknowledge the risks—and mitigate them when necessary.
Table 1 lists some of the failure modes for real-time ML projects. Some of them reflect miscommunication and business alignment issues; some reflect modeling problems and others relate to engineering and implementation mishaps. Many of these are also relevant for batch (i.e. non-real-time) ML projects.
Table 1: Non-exhaustive list of failure modes for Real-time ML projects
Now let us go through some practical steps you—ML practitioner or project manager—can take to address these. These are not listed in any order, and all of them have been found to be useful for us at Nubank.
De-risk: Educate clients where possible
Risks Addressed: All of them
Many people still see ML as “witchcraft”, and have unrealistic beliefs of what it can do. You can—and should—help clients understand, at least at a high-level, how ML works, so they can calibrate their expectations to more realistic levels, and help you do your job.
People from sectors such as banks are more used to working with models but even there it is not wise to assume that they understand, even at a high level, what ML is.
Here are three key points all non-technical users should understand about ML:
One good way to help end-users build intuition about ML is to show feature importance plots, such as the beeswarm plot (seen below in Figure 3). This drives interesting discussions with non-technical users and helps them understand how a model score is calculated from its features.
De-risk: Make sure the data is good
Risks Addressed: Model performance isn’t good enough | Model performance at inference time is different from training time.
The saying, “garbage in, garbage out”, encapsulates the essence of ML. It’s your responsibility to make sure that the data is good enough for modeling purposes.
As always, real-time ML adds an extra dimension to this problem, so not only do you need to worry about the training data—whether there is enough of it, whether it’s good quality—but also about how to retrieve data at inference time.
We suggest a three-pronged de-risking approach here: (a) asking the correct questions about the data, (b) doing extensive data analysis and (c) establishing a relationship with the real-time data team:
Ask the correct questions about the data
At the start of the project, you want to ask many questions about what data is available, what it looks like. Here are some suggestions:
Do extensive data analysis
When you have access to the data, you need to conduct the usual EDA routines to check for data quality, and you need to double check every information you were told about the data, to make sure you aren’t misled into making wrong decisions.
Establish a relationship with the real-time data team as early as possible
The teams responsible for making data available at real-time are often not the same as those responsible for data “at rest”.
Find out who these people are and keep in touch with them so you can make sure the information you will train the model with will also be available for you when you need to make real-time predictions. Make sure you understand how fast or slow such data retrieval is.
De-risk: Understand how the model predictions will be used
Risks Addressed: Model performance isn’t good enough | Model response time is too high | Use-case doesn’t support probabilistic decision-making.
Imagine crafting a perfectly good model, with great performance, only to see it gather dust. The worst thing that can happen to an ML project is to produce a model that is never actually used to make business decisions.
One of the reasons this happens is that the modeling team didn’t take the time to clearly understand how the model is to be used, and by whom.
A key part of understanding the model use is to discuss the decision layer—the code or business process that will take in the model predictions and decide what action to take based on the predictions. Talking about the decision layer forces the modeling and client teams to discuss how exactly the model predictions will be used—and sort out any misunderstandings in the process.
Figure 4 shows an example of what a very simple decision layer could look like for a simple credit-risk model.
In addition to discussing the decision layer, the more questions you ask about the model use-case the higher the chance that you will uncover implicit assumptions that can cause problems later on.
Some of questions you should ask yourself:
De-risk: Conduct a pre-mortem
Risks Addressed: All of them
A post-mortem meeting takes place after a project has failed. Its objective is to understand the causes of failure so that they can be avoided in the future.
A pre-mortem is similar, but it takes place at the start of a project. It’s a brainstorm-like meeting and its objective is not to understand why a project failed—but to prevent it from failing. It works by having people pretend the project failed and ask them: “why did it fail?”.
There aren’t many rules as to how the meeting should be conducted, as long as it does take place.
At the end of a good pre-mortem session, you should have not only a much better grasp of the risks you were already aware of but also knowledge of hitherto unknown risks you can now assess.
This is why it’s important for the meeting to include people from different backgrounds, such as business experts, other ML practitioners and—most importantly for real-time ML projects—software engineers, to signal potential integration and data problems.
De-risk: Calculate project valuation
Risks Addressed: Model is not economically viable
Every applied ML project should have a tangible impact on the organization. Such impact is often measured in monetary units (USD or equivalent) or other business metrics, such as conversion or engagement, among others.
Calculating a project’s valuation means understanding the project’s impact in terms of these metrics. It helps de-risk the project because you can find as early as possible whether the project is economically viable—and adjust course if it isn’t .
Wondering when to calculate valuation? We suggest doing it in two cycles:
De-risk: Select features based on importance and implementation effort
Risks Addressed: Feature creep | Chosen features aren’t available at inference time
In batch (i.e. non-real-time) ML models, features are roughly equal in terms of the effort needed to implement them. Sure, some may require a slightly more involved SQL query, some extra joins, but rarely more than that.
Real-time features, however, vary wildly in terms of how much engineering effort they take. Some features may be fetched with a simple HTTP call at inference time. Others may require that you build a service and an endpoint so that the model can use it at inference time.
During feature selection, you must also take into account the work needed to implement a given feature—assuming it is available in the real-time infrastructure in the first place! In Figure 5 below we see a 4-way classification of features depending on the two relevant dimensions for selecting features: predictive power and implementation effort.
As for how to rank features in terms of implementation effort, we suggest starting with the questions in Table 2. As usual, the easiest features are those available from a feature store (with the added plus you don’t need to worry about train-serve skew).
Table 2: Ease of implementation for real-time features, assuming a synchronous microservice architecture
Once a reasonable set of features is decided upon, the project lead should declare a feature freeze, such that only the selected ones will be included in the project. This decreases the risk of feature creep—a situation where new features keep being added to the model and the project never ends.
De-risk: Address highest-risk features first
Risks Addressed: Model performance at inference time is different from training time | Chosen features aren’t available at inference time | Model performance isn’t good enough
If you follow the maxim “select features based on effort”, you’ll have a rough estimate of the engineering effort needed to implement each feature. Engineering effort estimates, however, are known to be imprecise and hard to get right—somewhat akin to witchcraft.
Usually (but not always) features that take the most effort will also be the riskiest ones to implement. By risky we mean that they can jeopardize the project’s success.
Start implementation with these to address that risk early on. In other words, shift implementation of risky features left in the project timeline.
Why? Shifting these risky tasks left allows you to discover potential showstoppers early on: External endpoints that are too slow for your needs. Services that cannot handle the scale of requests being made. Features that turned out to be flat out unavailable.
The earlier the modeling team is aware of feature problems, the earlier they can adapt—using proxies instead or even dropping them altogether.
As soon as features have been selected you can already start work on the implementation discovery (brainstorming, designing implementation strategies, discussing with other engineers, etc). The earlier the better.
Shifting risk left is good advice in all parts of any project, but especially during feature implementation in real-time ML, because of cascading changes that will trigger new modeling rounds and decision layer refitting (see Figure 2).
De-risk: Deploy the model in shadow mode as early as possible
Risks Addressed: Model response time is too high | Model performance at inference time is different from training time | Chosen features aren’t available at inference time
The final leg of the implementation—where you connect the caller service to the real-time model—is the riskiest part of the project from an engineering perspective. Many unforeseen issues appear when the time comes to connect regular services to real-time ML models.
But it doesn’t need to be that way. By adopting a shadow-mode deployment from the outset, you can simulate an end-to-end flow without awaiting project culmination.
Shadow-mode deployment is the name for an applied ML pattern whereby you implement a real-time ML model, but ignore the response at the very last moment, for example, using a feature flag. Shadow-mode deployments are great to de-risk projects because you are able to observe how the model works in a “real-life” scenario without exposing yourself to any business risk. Figure 6 provides a visual representation.
Deploying a model in shadow mode is useful in and of itself. But doing it at the start of the project—even before all the features have been implemented—is even better:
Figure 7 below shows this: enabling shadow-mode at the start de-risks the project and makes it more efficient.
Sure, shadow-mode may demand upfront setup work, but it’s a game-changer for ML projects, ensuring smoother, safer real-time model interactions.
Conclusion
The truth is that real-time ML is hard because it involves ML models and real-time services—each of those being a complex system that depends on several parts that must work perfectly.
We listed many of the failure modes of real-time ML, but note that many of those are related to ML in general, with some particularities for the real-time setting.
We also suggested several practices to help you avoid those failure modes. However, even when following all of those it’s still possible that the project will fail—life happens—but success chances will be higher.
And remember, everything in this post is related to getting the model out of the door in the first place. After that, it’s still a tough road to keep it operating. After the first deployment, you’ll need careful monitoring (especially train-serve skew) and alerting to make sure everything is working well.
Check our job opportunities