Automatic retraining for machine learning models

Written by Felipe Almeida, with contributions from Marcelo Koga, Raphael Soares, Gabriel Bakiewicz and George Salvino

Machine learning models start decaying the moment they start being used — in fact, they must, by necessity, lose performance because they are trained on a static picture of the world (a train set).

The world, however, changes often, making that snapshot obsolete as time goes by. These changes in the world (as represented by the features used) are usually referred to as “concept drift” or “data drift”.

The velocity with which the model decays depends on several reasons, for example:

How many features you’re using and how correlated they are.
How regularized and/or robust the model is.
Unforeseen, sudden events that fundamentally change things (e.g. covid).
The inherent variance in the problem domain.
How adversarial your use case is (i.e. how much incentive people have to try and fool the model on purpose).

Sometimes the easiest and most robust way to update your model is by just retraining it on newer data, without necessarily adding new features. This enables your model to have a look at a more recent snapshot of the state of the world and update its parameters accordingly.

*After each model retrain (periodic retraining schedule shown) , its performance will increase* *as it’s trained with more recent data – but it’s unlikely to be much better* *than the original performance, unless you add more features* *(inspiration drawn from* *mlops.org*)

There are some risks and many tradeoffs involved in this and it’s probably only suitable when you already have a solid, well developed Machine Learning Operations flow to start with, but there are clear benefits too.

In the next sections we’ll go over some of the lessons we learned over the last years retraining models in an automatic fashion at Nubank.

Prerequisites: CI/CD, monitoring and a solid data pipeline

If you can’t confidently deploy a new version of a model with a single click, it will not be possible to have automatic retraining.

At the very least you need these three components in your MLOps architecture:

CI/CD: some sort of system whereby newly pushed code gets tested automatically (unit/integration) and code changes automatically trigger a new deployment to production.

Model Monitoring: some sort of system where you can continually see statistics about your model and data, such as: input feature distribution, score distribution and performance against ground truth, for example.
- Without proper monitoring, you don’t even know how good/bad your current model is; you shouldn’t be thinking about replacing it yet.

Solid data pipeline: you need a way to generate train/test datasets (including targets) programmatically and reliably. (Ideally, you should have dedicated Data Engineers in the team to take care of this for you).

Check our job opportunities

“Automatic” usually doesn’t mean fully automatic

Automatically retraining and deploying a new model is probably not worth the risk. To deploy a faulty model and have it be used to make faulty decisions in production is one of the very worst things that can happen in a data-driven organization.

One alternative suggestion is to have the last step of the pipeline just open a pull request (PR) on Github, with the compared performance of the baseline and the challenger models. But the PR still needs to be manually approved/merged before triggering an actual deployment that will replace the current model with the new model in production.

In our experience, this is a good cost-benefit tradeoff between having no automation at all on the one hand and having the process fully automated (no manual intervention at all) on the other.

Don’t forget the policy layer

Let’s assume the predictions from your model are actually being used by some downstream team/system (if a model isn’t being used, it’s a liability, not an asset — and you should probably retire it).

The way other teams use the predictions output by the model is called the policy. The simplest way to derive a policy from a model score is by using a simple if-statement whereby some action (e.g. give the loan, purchase an asset) is to be taken if the model score is above some given threshold.

The simplest possible example of a policy is a simple score threshold: IF model score ≥ threshold, THEN take some action

Whenever you retrain a model, you may need to adjust the policy too. This is crucial — if you forget to adjust, it may render all downstream uses of the model inaccurate.

Depending on the complexity of the policy, you will find that it’s way harder to retrain or refit the policy than it is to retrain a model. That may become a bottleneck in your pipeline if you are not careful.

In our experience, one way to make policy adjustments easier is to make your models use calibrated probability predictions.

Use fair comparison benchmarks

After successfully retraining a model on newer data, you probably want to compare its performance against the current model in production.

There are many ways to compare models:

Compare feature importances (e.g. using SHAP plots).
See which one has the best out-of-sample (OOS) performance (e.g. ROCAUC, PRAUC, logloss, etc).
How stable are the model predictions over time?

Whatever metric you want to use to compare the two versions of the model (i.e. the baseline and the challenger), the metric needs to be calculated on an unbiased holdout dataset.

This means that the benchmark dataset must be out-of-sample from the point of view of both models, otherwise you risk data leakage from either training set and your results will not be trustworthy.

Both the baseline and the challenger models must be evaluated on an unbiased test set that is part of neither training set. If your data is timestamped, a good suggestion is to use an out-of-time (OOT) holdout dataset that is outside the training period for both models

P.S.: If, for any reason, you don’t have access to an unbiased dataset to test both models against, you may want to deploy both versions (baseline and challenger) together as part of an A/B test (and monitor them closely!).

Be careful with feedback loops

When talking about Machine Learning models, the term Feedback Loop refers to cases where the use of the model affects the future training set it will be retrained on.

Feedback loops in Machine Learning Models: when the model affects its own future training data

There are several examples of feedback loops in Machine Learning:

Credit underwriting: we are only able to observe nonpayment for customers who received some money (by definition, not representative of the full population).
Recommender systems: The model action will change the user’s behavior in the future (e.g. a customer service model or a system to suggest films to watch).

Why is feedback loops a problem for automatic retraining? It’s not necessarily always a problem, but it’s definitely something you need to be aware of — and maybe mitigate in some way, depending on your business needs.

It may become a problem if the bias introduced by the model changes your training data in a way that hinders model training. This is not a simple problem to solve, but there are some ways to go about it, for example:

Can we use a control group that is not affected by the model’s decisions, so we can observe them without bias?
Can we infer what would have been the true label had the model not taken any action?

*A simple way to understand feedback loops is to think about models that predict the risk of loan default.* *This is the classic feedback loop problem in credit underwriting systems – we are only* *able to observe labels from (and therefore, train models on) approved loans. This adds a noticeable bias in the data collected after the introduction of the model.*

Unit-testing (sort of) for ML models

Ideally we would like to be able to guarantee that a newly retrained model will be better than the current version for every single example scored.

It’s usually impossible to have that sort of guarantee because ML models are inherently stochastic — saying that one model performs better than another really only means that it’s better on average.

One way to get some ease of mind is to have a couple of synthetic examples that any model should be easily able to classify correctly and then score these “dummy” examples with the retrained model and verify that it got the expected outcome.

*Validating models against dummy examples, created specifically to represent* *from a business point of view, is another way to sanity check retrained models.* *In this case, there seems to be something wrong with the challenger model, it got example 3 wrong* *while the current, older model got it right.*

The expected outcome may be the predicted class in case of classification or a score within some range if you have a regression model.

Evaluating performances across subpopulations/strata also helps, as you can see below.

Evaluate performance across subpopulations/strata

It’s very likely that the examples scored by your model can be split into separate subpopulations or loose groups. For example:

a credit underwriting model that estimates default risk for individuals and for companies.
A model to detect fraud in online purchases but also in in-person purchases and international purchases.
A price forecasting model that estimates prices for houses, apartments and condos.

Evaluating the performance metrics of both models (baseline and challenger) separately for each stratum helps you double check that no unexpected problems leaked into the new training set.

*When you compare the baseline vs the retrained (challenge) model without looking at* the *individual performances in each strata, some details may go unnoticed. In* *this case, retraining did increase the model performance globally, but* *one of* *the substrata got worse*.

Other tips

Don’t couple the retraining pipeline with a specific model

When you write a pipeline (e.g. airflow, kubeflow or standalone scripts) to retrain the model on newer data and run all analyses, make sure it’s written in a way that’s not coupled to a single model or use case.

Be sure to follow standard software engineering practices: pass variables as arguments so that the pipeline can be used for other models, for arbitrary date periods, using custom comparison or evaluation strategies, etc.

Take label censoring into account

Labels are often obtained from different data sources from the features. Moreover, labels are usually lagged in time with respect to the scoring time.

This means that the ground truth label for the example scored today might only be available after some time (e.g. when the loan was repaid/defaulted on).

Don’t forget to take this into account when deciding the date periods on which to train/validate retrained models (i.e. you should probably ignore (or censor) data for which targets aren’t available yet).

Take costs into account when deciding how often to retrain

There are usually costs associated with retraining models, some which being:

System costs: Running the necessary data jobs and training models needs computing resources and these costs money, especially at Web scale.
Human/labor costs
- Data Scientists will need to spend some time debugging and/or checking the results before deploying a retrained model into production
- Downstream users may need to update their processes to adapt to changes in model precision/performance.

These costs must be compared with the expected benefits of retraining a model (usually increasing performance compared to the baseline), otherwise it doesn’t make economic sense to do it.

Tool suggestions

Kubeflow Pipelines (kfp)

Kubeflow has a component called Kubeflow pipelines (kfp in short) that can be used to build pipelines and workflows, including ones to retrain a model, like building training datasets, training classifiers and running evaluation routines, for example.

We use this at Nubank and it has served us well so far.

Papermill

Papermill is a tool used to make Jupyter notebooks executable and parameterizable (i.e. you can execute notebooks as if they were scripts and pass parameters to them).

It’s a useful tool to bridge the gap between development-time code (usually created by Data Scientists) and production-time code.