most read
Software Engineering
Why We Killed Our End-to-End Test Suite Sep 24
Culture & Values
The Spark Of Our Foundation: a letter from our founders Dec 9
Software Engineering
The value of canonicity Oct 30
Careers
We bring together great minds from diverse backgrounds who enable discussion and debate and enhance problem-solving.
Learn more about our careers



Not so long ago, machine learning applications started blossoming across various industries. Companies swiftly adapted their existing infrastructure in order to be able to ship machine learning models that would generate predictions in batches.
It works well enough. Now, it seems that everyone is talking about how real-time machine learning is the future. But is it for real? Should we really go through the extra effort required to put (and maintain) real-time models in production?
Are you interested in a mini roadmap to help you reason about when and how to build real-time machine learning models? Keep reading this article!
What are real-time models?
There isn’t a single definition of what a real-time model is.
The first thought that likely comes to mind is that real-time means real-time learning, where a model continually receives new training data and updates its parameters. This is certainly exciting, but still rare to find in real life. However, real-time also means real-time inference, where a trained model is able to receive requests at any time and return predictions synchronously. This is more commonly found in real life.
In this article, we present strategies for building models that make predictions in real time.
Check our job opportunities
Why use real-time models?
Sometimes you have to use a real-time model simply because the problem you are tackling requires instant decision making.
Suppose you were asked to come up with a ML-based solution that will help scale and improve user service. Every time a user opens up the chat in our app and writes a message, we should automatically identify what they are talking about and take action accordingly (e.g. redirect them to a chat with a human specialist).
We might frame this problem as a multi-class classification problem, in which classes could be different products (such as credit card, savings account, investments, and so on), and build a classifier that receives the text written by the user and returns the product they are most likely to be talking about.
This use case requires a real-time model because the model feeds off freshly generated data, and also because the user expects a quick answer. From that specific use case, we can try and come up with some rules of thumb to help us decide when using a real-time model is ideal (or required):
Alright. There are tons of good reasons for building real-time models. Now it’s just a matter of deploying them. Should be quick and easy, right? We could do that in less than 10 lines of code:
Not so fast.
How to build real-time models?
Since infrastructure constraints are likely to impact modeling decisions, building a real-time model requires very close collaboration between the data scientist and the machine learning engineer.
We will talk about two requirements that we need to keep in mind from the start of the development of a real-time model: real-time pipeline and fast inference.
Real-time Pipeline
A real-time pipeline should gather and prepare all inputs required by the model. Data may be collected from different sources:
After gathering the data, we still need to preprocess them. Historical data coming from the feature store are already preprocessed, whereas fresh data coming from the request payload or from streaming events are in their most raw format. Now, we can clearly see that we have two separate pipelines: a batch pipeline and a real-time pipeline.
We want to make sure that the preprocessing function that was applied to data in the batch pipeline during training is exactly the same function that is applied to data in the real-time pipeline during inference. Failure to do this is referred to as train-serve skew.
Fast Inference
Recall that super cutting-edge neural network you built that achieved like 99% accuracy for all classes? If you try and measure its prediction time, you might be surprised to find out that it can take many seconds. Even though it sounds fast, especially for a big neural network, it actually isn’t.
A response that is considered fast usually takes milliseconds. Think about how long the user would be willing to wait before they retry an action or simply leave the app.
Using more powerful hardware (such as GPUs) seems like a reasonable quick fix, but it might be harder to maintain in the long term, since it would likely be a non-standardized solution and require closer monitoring. Moreover, overall response time might not be fast enough. If we had a heavy model that needs to run inferences on GPU, there would be considerable communication overhead between CPU and GPU.
On the other hand, building lighter models is more cost-efficient and easier to maintain. If we were using light models, we would be able to scale machine learning services horizontally just like regular microservices, possibly using already existing in-company tools.
Heavy models can be compressed using various techniques, such as:
It is worth noting that pruning and quantization are available both in TensorFlow and in PyTorch, so it should be quick and easy to run experiments combining different techniques.
Besides compressing models, we may also evaluate the effectiveness of using caching to store some predictions. In our use case, after preprocessing the text input, we might end up with an input that frequently repeats. In that case, we would call the model only the first time that input is seen; then in subsequent calls, we’d fetch the prediction directly from cache.
But… is this real life?
It sure is! Most companies begin their machine learning journeys by experimenting with batch models, since they are perceived as an easier and safer approach. However, as machine learning experts and business stakeholders work together to discover new areas where machine learning could be applied to maximize value, problems that require real-time models (such as the chat model we’ve talked about) inevitably arise.
Tons of companies are already shipping real-time machine learning models in a safe and scalable way, including Nubank. If you’re curious about what is possible to do with real-time machine learning systems, come join us.
To learn more about real-time models, check out the recording of Ana’s talk at the Building Nu Meetup:
Written by Ana Martinazzo
Reviewed by Felipe Almeida
Check our job opportunities