most read
Software Engineering
Why We Killed Our End-to-End Test Suite Sep 24
Culture & Values
The Spark Of Our Foundation: a letter from our founders Dec 9
Software Engineering
The value of canonicity Oct 30
Careers
We bring together great minds from diverse backgrounds who enable discussion and debate and enhance problem-solving.
Learn more about our careers



Written by Felipe Almeida, with contributions from Marcelo Koga, Raphael Soares, Gabriel Bakiewicz and George Salvino
Machine learning models start decaying the moment they start being used — in fact, they must, by necessity, lose performance because they are trained on a static picture of the world (a train set).
The world, however, changes often, making that snapshot obsolete as time goes by. These changes in the world (as represented by the features used) are usually referred to as “concept drift” or “data drift”.
The velocity with which the model decays depends on several reasons, for example:
There are some risks and many tradeoffs involved in this and it’s probably only suitable when you already have a solid, well developed Machine Learning Operations flow to start with, but there are clear benefits too.
In the next sections we’ll go over some of the lessons we learned over the last years retraining models in an automatic fashion at Nubank.
Prerequisites: CI/CD, monitoring and a solid data pipeline
If you can’t confidently deploy a new version of a model with a single click, it will not be possible to have automatic retraining.
At the very least you need these three components in your MLOps architecture:
Check our job opportunities
“Automatic” usually doesn’t mean fully automatic
Automatically retraining and deploying a new model is probably not worth the risk. To deploy a faulty model and have it be used to make faulty decisions in production is one of the very worst things that can happen in a data-driven organization.
One alternative suggestion is to have the last step of the pipeline just open a pull request (PR) on Github, with the compared performance of the baseline and the challenger models. But the PR still needs to be manually approved/merged before triggering an actual deployment that will replace the current model with the new model in production.
In our experience, this is a good cost-benefit tradeoff between having no automation at all on the one hand and having the process fully automated (no manual intervention at all) on the other.
Don’t forget the policy layer
Let’s assume the predictions from your model are actually being used by some downstream team/system (if a model isn’t being used, it’s a liability, not an asset — and you should probably retire it).
The way other teams use the predictions output by the model is called the policy. The simplest way to derive a policy from a model score is by using a simple if-statement whereby some action (e.g. give the loan, purchase an asset) is to be taken if the model score is above some given threshold.
Whenever you retrain a model, you may need to adjust the policy too. This is crucial — if you forget to adjust, it may render all downstream uses of the model inaccurate.
Depending on the complexity of the policy, you will find that it’s way harder to retrain or refit the policy than it is to retrain a model. That may become a bottleneck in your pipeline if you are not careful.
In our experience, one way to make policy adjustments easier is to make your models use calibrated probability predictions.
Use fair comparison benchmarks
After successfully retraining a model on newer data, you probably want to compare its performance against the current model in production.
There are many ways to compare models:
This means that the benchmark dataset must be out-of-sample from the point of view of both models, otherwise you risk data leakage from either training set and your results will not be trustworthy.
P.S.: If, for any reason, you don’t have access to an unbiased dataset to test both models against, you may want to deploy both versions (baseline and challenger) together as part of an A/B test (and monitor them closely!).
Be careful with feedback loops
When talking about Machine Learning models, the term Feedback Loop refers to cases where the use of the model affects the future training set it will be retrained on.
There are several examples of feedback loops in Machine Learning:
Why is feedback loops a problem for automatic retraining? It’s not necessarily always a problem, but it’s definitely something you need to be aware of — and maybe mitigate in some way, depending on your business needs.
It may become a problem if the bias introduced by the model changes your training data in a way that hinders model training. This is not a simple problem to solve, but there are some ways to go about it, for example:
Unit-testing (sort of) for ML models
Ideally we would like to be able to guarantee that a newly retrained model will be better than the current version for every single example scored.
It’s usually impossible to have that sort of guarantee because ML models are inherently stochastic — saying that one model performs better than another really only means that it’s better on average.
One way to get some ease of mind is to have a couple of synthetic examples that any model should be easily able to classify correctly and then score these “dummy” examples with the retrained model and verify that it got the expected outcome.
The expected outcome may be the predicted class in case of classification or a score within some range if you have a regression model.
Evaluating performances across subpopulations/strata also helps, as you can see below.
Evaluate performance across subpopulations/strata
It’s very likely that the examples scored by your model can be split into separate subpopulations or loose groups. For example:
Other tips
Don’t couple the retraining pipeline with a specific model
When you write a pipeline (e.g. airflow, kubeflow or standalone scripts) to retrain the model on newer data and run all analyses, make sure it’s written in a way that’s not coupled to a single model or use case.
Be sure to follow standard software engineering practices: pass variables as arguments so that the pipeline can be used for other models, for arbitrary date periods, using custom comparison or evaluation strategies, etc.
Take label censoring into account
Labels are often obtained from different data sources from the features. Moreover, labels are usually lagged in time with respect to the scoring time.
This means that the ground truth label for the example scored today might only be available after some time (e.g. when the loan was repaid/defaulted on).
Don’t forget to take this into account when deciding the date periods on which to train/validate retrained models (i.e. you should probably ignore (or censor) data for which targets aren’t available yet).
Take costs into account when deciding how often to retrain
There are usually costs associated with retraining models, some which being:
These costs must be compared with the expected benefits of retraining a model (usually increasing performance compared to the baseline), otherwise it doesn’t make economic sense to do it.
Tool suggestions
Kubeflow Pipelines (kfp)
Kubeflow has a component called Kubeflow pipelines (kfp in short) that can be used to build pipelines and workflows, including ones to retrain a model, like building training datasets, training classifiers and running evaluation routines, for example.
We use this at Nubank and it has served us well so far.
Papermill
Papermill is a tool used to make Jupyter notebooks executable and parameterizable (i.e. you can execute notebooks as if they were scripts and pass parameters to them).
It’s a useful tool to bridge the gap between development-time code (usually created by Data Scientists) and production-time code.
Check our job opportunities