Making Real-time ML Models more robust in adversarial scenarios: Practical Tips and Monitoring Considerations

Reviewed by: Gustavo Durães, Juliano Garcia, Renata Andrade, Moisés Rojo, Mateus Magalhães

Introduction

Real-time machine learning (ML) models are frequently used to drive decisions in business processes. While they are more complex to implement and operate compared to batch models, they are often more effective because they leverage real-time data as features and they unlock use cases where immediate user feedback is essential.

Heads-up: Although much of the content in this article also applies to GenAI and Computer Vision, the focus is on traditional, tabular ML!

Traditional (i.e. non-AI) ML-enabled real-time systems usually work as follows: An external action (e.g. a user interacting with a digital bank mobile app) sets off a chain of events that causes services to send data to a real-time model, which then outputs a prediction based on that data. That prediction is passed on to another service (sometimes called the decision layer) to determine the appropriate business decision (e.g. issue or reject a loan request, accept or deny an insurance policy request). A simple example can be seen in Figure 1:

Figure 1: Traditional ML-enabled real-time system. After users interact with a system (1), real-time data is turned into features (2) which are fed to a real-time model. The model generates predictions (3) which are passed on to a decision layer and then decisions are passed back (4) onto the caller service. Finally, a side-effect or response is passed back to the user (5).

Features used in real-time ML models often come from user behavior patterns and signals, such as geolocation data, social media engagement, or past user actions. This means that some features may directly be under the user’s control.

Most people are aware that ML models are used in digital systems today (e.g., estimating risk, recommending content, ranking items). Some may attempt to manipulate ML feature values to influence the system’s behavior. Such attempts — usually by ill-intentioned actors — to manipulate feature values are called adversarial attacks.

Malicious attempts to manipulate the values of features to modify model outputs are called adversarial attacks.

Models may be prone to adversarial attacks if two conditions are met: (a) features can be controlled in some way by users and (b) there exists an incentive for people to manipulate the model. Real-time ML models are even more vulnerable; the learning cycles are much shorter and decisions made upon the models’ output may happen immediately, with no human intervention.

Real-time ML models are doubly vulnerable to adversarial attacks; business decisions are often made automatically from model outputs – and it may take a while for attacks to be detected

Let’s explore some examples where real-time ML models may be vulnerable to adversarial attacks:

Example 1: Real-time credit underwriting

The decision of whether or not a retail bank should give out a loan to a customer is often made by ML models. Sometimes, by real-time models. The reason to prefer real-time over batch models is that quick decisions provide better UX and increase revenue, all else being equal.

This presents a clear conflict; banks want to lend money to people who will repay it, but some customers may want to take out loans even if they are unsure they can repay them. Therefore, there is a financial incentive for some people to try to increase their chances of being approved for a loan by manipulating the ML model.

*Figure 2: Simplified real-time credit underwriting system using a real-time ML model to predict the default risk.*

Example 2: Real-time Fraud Detection in CC purchases

Many organizations use ML to detect fraud. As in the previous example, this creates an incentive for bad actors to try and fool models, as criminals want to carry out their fraudulent operations without getting caught.

One particular example is when a bad actor uses leaked information to try and make purchases using somebody else’s credit card. To protect customers, many payment processors employ real-time ML to detect and prevent such fraudulent transactions.

*Figure 3: Many payment processors use systems such as the above to detect and prevent fraudulent CC purchases from taking place.*

Example 3: Search Engine Ranking

Search engines use information from web pages to gauge their quality and determine how well they should rank in search results. This is usually done using ML/NLP ranking models.

The proper way to improve a webpage’s ranking is (as one would expect) to produce quality content that users want to click on and consume. The field of Search Engine Optimization (SEO) is built around this.

However, since ad revenue is a function of the number of views a website receives, there is a clear financial incentive for website managers to adversarially attack search engines to inflate their own rankings. This is called Spamdexing and includes many tactics such as using invisible text, link farms, and more.

Figure 4: Simplified flow for a hypothetical search engine that uses Web Page information to rank pages in order of relevance. Note that in this simple case there is no decision layer; many ranking algorithms output a ranked list directly.

As an ML practitioner using real-time ML in real-world scenarios, you want to make models less vulnerable to adversarial attacks and manipulation in general.

There are many ways to make them more robust (as we’ll see below) but the only sustainable long-term solution is to embrace the fact that user-facing models will be gamed eventually and have a frequent retraining schedule — so blind spots are quickly incorporated into the next retrain cycle.

Before we start with the tips, let’s address two basic prerequisites: 1) understanding feature importance and 2) having some kind of monitoring setup in place.

Check our job opportunities

Prerequisite 1: Understand feature importance and directionality

You need to know which features are the most important and which of those are vulnerable to user attacks and manipulation. This will help you identify where the majority of the risk lies and which features you need to focus on; It will also enable you to estimate the “blast radius” if a feature is exploited.

You need to know how the most important features impact the prediction and which of those can be exploited

At a minimum, you need to know:

What features are the most important in terms of absolute predictive power;
What features can be easily exploited by the user;
How these features affect the output (how specific values change the output and in what direction, on average, they affect the model predictions).

Knowing the absolute importance of features in a model (1) is usually easy; most implementations provide some way of identifying which features carry the most signal. There are also “model-agnostic” strategies such as SHAP. Counterfactual explanations may also be useful, especially to probe situations where interactions (rather than individual features) may be exploitable.

Knowing which features are vulnerable to attacks (2) is 100% dependent on the model implementation. You may need to collaborate with different team members (software engineers, data engineers, etc.) to understand exactly what information sources are used to build the model features at inference time – and whether end-users have control over those.

The two subsections below explain how the values for a specific feature affect the overall model output (3).

Understand directional feature importance

Feature importance for a trained model is usually presented as a list of features ranked by how much on average each feature contributed to the output, usually averaged over the training set.

But to protect models against abuse, it’s not enough to know that a feature is important; you also need to understand in what direction that feature impacts the output score. Knowing this is crucial because it will tell you where the incentive lies: attackers will naturally want to manipulate features in the direction that benefits them – to the detriment of the organization.

Directional feature importance: knowing not only how important a feature is, but whether it drives output scores up or down

One of the best ways to see this effect is via the SHAP Beeswarm Summary plot as seen below on Figure 5. In this example, the model output is the likelihood that a person made over US$50k in the 1990s. The Y-axis indicates the “absolute” feature importance, but the X-axis shows the direction of the impact on the final model score.

*Figure 5: The SHAP beeswarm summary plot is great to analyze the directional importance for features in a fitted model. In this example: large* *values of the “Age” feature increase* *the estimated income. On the other hand, large* *values of the “Marital Status” feature tend to decrease the estimated income.* *Source:* *SHAP documentation*

The Beeswarm summary plot is also useful for debugging features during model development: you can check whether the chart matches your intuition of how each feature should impact the model score – and help you spot errors such as faulty features.

Understand value-dependent impact for key features

Understanding the rough correlation between features and the model output is useful, but knowing which values to watch out for is even more important. This is especially important in nonlinear models such as gradient-boosted trees or neural networks, where relationships between features and the score may be highly nonlinear and counterintuitive.

Attackers usually learn these relationships through trial and error and they will eventually discover not only which features affect the model score, but also which values of those features move the score up or down.

To stay ahead of attackers, you need to understand which feature values are particularly “dangerous”. Then you can build defenses (such as special monitoring) focused on those values or even decide to remove the feature from the model altogether.

A partial dependence plot shows the impact between a feature and the model output on the Y-axis, over different values of that feature, on the X-axis.

Partial Dependence Plots show the effect of different values of a given feature on the final model score

For example, Figure 6 shows the impact of the “age” feature in a model trained to predict Heart Disease. Now suppose a Health Insurance provider uses this model to give financial benefits to people deemed at a high risk of having heart disease. If “age” is self-reported by customers. Ill-intentioned actors trying to receive undue benefits will eventually learn that misinforming their age – especially above 54 years – significantly increases the chances of being awarded those benefits.

Figure 6: For a simple tree model trained to estimate the risk of Heart Disease, the “age” feature (in days) starts to drive the risk up a lot starting at around 54 years of age. This feature is vulnerable to adversarial attacks if there exists an incentive to manipulate the model output. Adapted from Zito Relova via *Towards Data Science*

Figure 6 above uses Python’s PDPBox library but PD plots are also available in frameworks such as scikit-learn.

Prerequisite 2: Have at least a basic monitoring setup

Model monitoring is crucial to enable one to observe how the model is working and, most importantly, how it’s impacting business decisions. Without at least some basic model monitoring, you can’t even detect when/if there are adversarial attacks going on, let alone do anything about it.

You should at the very least have feature monitoring and decision layer monitoring.

Feature monitoring

By feature monitoring we mean charts and plots showing the feature values over time, for every feature. For example, in Figure 7 below we can see what a monitoring asset for a given feature in a model would look like. Although basic, charts such as this are useful to assess the overall health of your model and help detect sudden changes in the distribution, which could be indicative of exploit attempts.

Decision layer monitoring

Decision layer monitoring could also be called “business monitoring”; this is where one monitors the business decisions that are made as a result of the model output. If we go back to Examples 1 and 2 in the introduction of this blog post, it would mean monitoring the rate of loans denied over time over time and the rate of CC purchases denied over time, respectively.

Decision layer monitoring answers the question: “How are business decisions being impacted by model predictions?” This is crucial because this is where the actual impact of attacks shows up. Attackers exploit models as a means to an end – the end is always to force the system into a favourable decision (to them).

Decision layer monitoring answers the question: How are business decisions being impacted by model predictions?

This is just the bare minimum monitoring setup. Other dimensions to consider include operational metrics (CPU, RAM, Latency), performance metrics (F1, AUC, etc), train-serve skew and you should also add alerts on top.

For more monitoring considerations, see the last section in this post.

With the prerequisites out of the way, let’s dive into the tips for making ML models more robust against adversarial attacks.

Tip: Simple alerts help you detect attacks

While the goal is to prevent adversarial attacks, detection is crucial when prevention fails. For real-time ML, you need real-time detection systems that trigger alerts when certain conditions are met. Tools like Splunk, New Relic, and Datadog can be used to create alerts based on log queries.

Attackers know that they have to be quick; once they find a vulnerability they need to exploit it quickly before the targeted organization can defend itself against them. Therefore, alerts aimed at detecting adversarial attacks should try to detect sudden changes that could indicate an exploit is going on.

Also, you want to avoid alert fatigue – i.e. a large number of false positives – so you should base alerts on the actual business decisions, rather than modeling artifacts: instead of alerts based on raw model scores or features, prefer alerts based on the downstream actions that are impacted by the models, such as loans being issued, fraud incidents detected, customer messages routed incorrectly, etc

Here is one example of an alert that matches these criteria: “Raise an alert if the rate of denied CC purchases in the last 6 hours is more than 20% below the same figure last week” . This alert detects a sudden change (within 6 hours) in a business metrics (denied CC transactions) and it uses a comparison with the previous week, to account for seasonality.

One example of what to do once a true-positive alert goes off is to turn off a feature flag, as explained below. For more on alerting best practices, see Best Practices for Real-time Machine Learning: Alerting.

Tip: Frequent model retraining is the only sustainable solution at scale

User-facing ML-enabled systems will always be targets for attackers, especially in cases where there’s a possibility of financial gain. At the end of the day attackers run a cost-benefit analysis when planning attacks and your job as a practitioner is to increase the cost so much that it stops being worthwhile for most people to exploit your system; or else hoping attackers turn away from a life of crime and become upstanding members of the community instead. Right.

However, trying to fight each attack vector individually with handcrafted conditions eventually makes large-scale systems brittle and hard to maintain, defeating the whole purpose of letting data drive business decisions. It can turn into a whack-a-mole of sorts as you are constantly trying to defend systems against new exploits but attackers soon find new ways to circumvent defenses, and the cycle never ends.

A few guardrails and a robust retraining schedule are a good strategy to defend against adversarial attacks at scale

After you have some guardrails to prevent disasters it’s probably better to accept that some exploits will take place — then use those as training data. Retraining models frequently allows you to incorporate exploit attempts into the training set for the next model iterations – making the model learn from these cases, growing more robust over time.

One especially useful combination is to automate model retrains, as it reduces the likelihood of human error and frees up resources to enhance the systems.

Tip: Make trial-and-error cycles costlier for attackers

Attackers learn how to game ML-enabled systems through trial and error; they interact with the system repeatedly to learn its “quirks” and idiosyncrasies. Making it harder for them to learn increases the effort needed, and therefore decreases the relative incentive to engage in adversarial attacks.

Making it harder for would-be attackers to collect information about your models is a great way to mitigate adversarial risk; but make sure legitimate users aren’t harmed in the process.

We will now list examples of how to make these learning cycles costlier; you should, however, keep in mind that these strategies can have unintended consequences, hurting the experience of legitimate users. Your specific use-case will dictate whether or not the tradeoff is worthwhile.

Example 1: Limit the frequency of interactions

You can limit the number of times users can interact with your system in a given time period. For example, customers who want to request an insurance policy may only make one attempt per month. This would require ill-intentioned users to wait much longer to learn, or farm multiple accounts, etc.

It’s also why operating systems force you to wait some seconds to try again after you type in the wrong password. This is done to reduce the chance of brute force attacks.

Example 2: Delay the model decision

The closer in time the user input is to the model response, the clearer the causation links will be, and the easier it will be for people to find patterns. Mitigate this by delaying the model feedback, instead of providing an immediate response.

For example, when customers apply for a mortgage via a mobile application, choose to inform them about the outcome after some hours (even if the outcome is immediately available).

Tip: You may need to trade off robustness for performance

Sometimes the potential impact of adversarial attacks is so high that the only available solution is to purposefully “dumb down” a model by actively reducing its sensitivity — effectively trading off some performance for enhanced robustness.

The simplest (not necessarily the best) solution to address exploitable features is to just drop them from the model.

Here are some ways to achieve this:

Remove vulnerable features altogether: The simplest way to eliminate risk is to drop the feature from the model.

Use proxies instead of real features: Sometimes we can replace a problematic feature with a proxy. That is, some other piece of information that has high correlation with the original feature, but doesn’t expose the same attack surface area. Some examples include using coarser-grained variations instead of the actual values and using delayed information instead — that is, using lagged information obtained from an alternative, trusted source, instead of relying on user-provided information.

Replace problematic feature values with neutral ones: During EDA and model training it may be possible to identify neutral feature values — that is, values for which the average impact on the model output is close to zero. You can use PD plots to identify which feature values are especially prone to exploits and then replace them at inference-time by neutral values.

Increase model regularization: Applying regularization, explained in detail in the section below is another way to trade off performance for robustness, as regularized models often need to give up raw predictive power for better OOS generalizability.

Tip: Regularization is still your friend

Regularization is the process of making models less sensitive to feature changes. This is usually done to prevent overfitting, enhance generalization and to extend the shelf life of ML models, as regularized models tend to decay more slowly. But increasing regularization also helps making your model more robust against adversarial attacks

There are at least two approaches to regularization, with adversarial robustness in mind: the usual model-specific approaches (i.e. hyper params available for tuning in ML algorithms) and data-centric approaches (e.g. adding noise to your training dataset to make the model less sensitive to these features)

Model-specific regularization

By model-specific regularization, we refer to techniques that can only be used for specific ML algorithms. Most often these are model hyperparameters that decrease the model capacity and therefore its ability to overfit. In Table 1 below we list some examples of hyperparameters we have found to be particularly useful:

Algorithms	Parameters
Gradient-boosted Trees	Column Sampling Gamma Learning Rate Maximum Tree Depth Minimum Child Weight Number of Trees
Random Forest	Max Tree Depth Number of Trees
Linear/Logistic Regression	L1/L2 Regularization
Neural Nets	Batch Normalization Dropout Early Stopping L1/L2 Weight Regularization Learning Rate

Table 1: Non-exhaustive list hyperparameters used for regularization

Data-centric regularization

There are also regularization techniques that involve data preprocessing of some kind. The main advantage of these is that they can be shared across multiple models and they can be made to be feature-specific; you can choose to only regularize some features, while keeping others untouched. This (together with hyperparameters as seen above) should force the model to trust these features less, without outright dropping them.

One simple way to achieve data-centric regularization is to add uncorrelated noise to your training set – particularly to the vulnerable features that you want to grow more robust to.

There are also “domain-specific” ways one can restrict the values of a feature. For example, if “age” is a feature in your model, you can enforce a valid range for the values during preprocessing time by using floor and ceiling functions. Even if attackers find out how to modify these values, the damage will be somewhat limited because they will be within the predefined range.

Don’t forget to test this approach, however, to see if it works for you.

Tip: Use feature flags to quickly handle attacks

Even though we ultimately want to avoid adversarial attacks, one must be prepared to act quickly to limit damages when and if they happen. Feature flags (sometimes called feature toggles) are one way to handle attacks quickly.

Feature flags are simple IF/ELSE conditions in the code that control whether runtime execution takes one path or the other, essentially turning parts of the code on or off. This is shown in Figure 8 below:

*Figure 8: An example of a 3-way feature flag.*

Feature flags are usually controlled from outside the normal CI/CD flow — so that one can update them without the need for a full service deployment, which may take time and/or fail. A common strategy is to store feature flags as files in a remote storage system such as S3.

Here are some examples on how to use feature flags to react to an ongoing adversarial attack:

Make decision thresholds more strict: if a risky decision (e.g. issue a loan depending on a risk score) is made directly from a model prediction, you can make that decision stricter or more conservative temporarily. For example, require a much lower risk score than usual to approve a loan.

Disable the business flow entirely: depending on the importance of the business flow impacted by the attack, you may want to turn it off altogether even at the cost of some UX problems.

Replace gameable features with neutral values: Hardcoding neutral feature values instead of using the actual user-provided information is another way to fend off attacks. You can find out what these values are using PD plots, as explained in Understand value-dependent impact in key features section.

Use a fallback version of the model, without any gameable feature: Another alternative is to have a “fallback” model version which doesn’t use any potentially exploitable feature. Using a simpler model version may cause a performance hit, but it will likely be better than taking the impact of an attack head-on. This strategy is shown below as pseudocode in Listing 1.

*Listing 1: Python pseudocode showing how one would employ S3-based feature flags to control whether a fallback model version should be used.*

Feature flags are very useful in all sorts of other situations unrelated to ML too. They are an interesting topic in and of themselves.

Tip: Favor coarse-grained encoding for especially sensitive features

Sometimes you just cannot afford to drop vulnerable features from a model. This may happen due to regulatory requirements or with features that are absolutely crucial to the model, and removing it would render the model useless.

One way to keep potentially exploitable features in a model (while reducing the attack surface a little bit) is to decrease their granularity. This usually makes the model slightly less accurate but provides a level of protection against attacks.

For numerical features, this usually means grouping values into bins, or buckets. Several software packages support binning or “bucketization” as preprocessing steps in a feature engineering pipeline.

A feature’s granularity refers to how detailed it is. When there’s risk of attacks, it may help to favor low-granularity (coarse-grained) over high-granularity (fine-grained) features

With categorical features, decreasing granularity means reducing their dimensionality: either by grouping values together or by using methods such as PCA after encoding. For continuous and/or numeric features on the other hand, you can use buckets (also called bins) instead of raw values.

Decreasing feature granularity helps in two ways:

It reduces the dependence of the model with respect to that feature, in practice acting as regularization.

It reduces the chance that small perturbations to the feature value will change the model score -– assuming the group boundaries aren’t crossed.

It’s important, however, to verify (with PD Plots for instance) that this approach works for your specific scenario. In some cases, the modified features still present an attack surface, particularly if ill-intentioned users can infer where the boundaries are.

Monitoring considerations

Monitoring is essential for detecting adversarial attacks in real-time ML, as explained above. Here are some practical tips to enhance your monitoring setup.

Monitor percentiles for feature values

Most monitoring tools focus on monitoring average values of numerical features. Though useful, averages hide information, particularly when decisions are made using predictions in the extremes, as is the case when working with risk predictions.

To avoid potential blind spots, you need to monitor feature percentiles too, especially those related to the extremes (percentiles 1, 10, 90 and 99), so you can visually detect weird deviations, which may indicate adversarial attacks are taking place.

Figure 9: A typical monitoring chart, showing aggregate values per month for a real-valued feature that ranges from 0 to 1. Something happened in July 2024, but it wouldn’t be very obvious if one only looked at the avg value line. When looking at the P90 and P99 percentiles, it’s clear there was a clean break in the value distribution but it only affected the upper percentiles.

Monitor business decisions

Although attackers work by trying to influence model features, their ultimate goal is to impact business decisions. So these must be monitored separately. Very often, attacks are subtle but they are just strong enough to move the score over the decision boundary, affecting business decisions. In other words, you want to monitor the distribution of the decisions driven by the model at the end of the funnel. This is what we called “decision layer monitoring” in Prerequisite 2.

In the example shown in Figure 10, we show the claim approval rates for a fictitious car insurance company. The decision to approve or deny an insurance claim is made using an ML model. The peak in April 2024 is reason for concern: it could be the result of attackers trying to engage in insurance fraud. However, the average prediction of the model driving the decisions doesn’t seem out of the ordinary; without the decision layer monitoring, we would be blind to a potential attack.

Figure 10: On the left, a chart showing the rate of policy claims accepted by an insurance company. The peak in April 2024 could be indicative of an adversarial attack, but note that the corresponding score (on the right) didn’t budge. Without dedicated decision monitoring, this potential attack could have been missed.

Monitor by segment

Segmentation refers to dividing instances scored by a model into meaningful groups, for monitoring purposes. The main idea is that viewing aggregate metrics for all instances as a combined group hides important details: since the sample size is larger, metrics tend to be smoother, and it’s harder to detect issues that only affect a subset of the full population, especially if those are less common.

Examples of segments include:

Credit card purchase types: physical purchases, contactless purchases, online purchases

Customer types: low income, medium income, high-income

Visitors on a website: those using browsers, mobile phones, tablets, etc.

Viewing metrics by segment is also useful to detect adversarial attacks, as they may be targeted at specific subgroups, rather than full instance space as a whole. Figure 11 below shows how visualizing metrics by segment allows one to detect issues that would otherwise go unnoticed if only looking at the global aggregate values.

Figure 11: On this chart we can see that the global average of “Age” is mostly driven by Mobile Phone users, which are the most frequent users in the customer base. However, there was a strange peak in the value for this feature, but only for Browser users; were it not for segment-level lines, we would not be able to detect it – and a potential attack could go unnoticed.

Input attributes vs features

Many model pipelines include feature engineering or pre-processing on top of raw user-provided inputs before using those into the learning algorithm. It may be hard to detect directed attacks if one only monitors the features, because there may be multiple processing steps until an input attribute actually becomes a feature.

You should have separate monitoring assets for the input attribute values — in addition to monitoring features and predictions over time.

Building effective, resilient real-time ML models

Real-time ML models are powerful but vulnerable to adversarial attacks, where malicious actors manipulate features to influence outcomes. To protect your models, focus on understanding feature importance with tools like SHAP and Partial Dependence Plots, and set up active monitoring to detect unusual patterns. Use real-time alerts based on business metrics to catch attacks early.

Frequent model retraining helps your system learn from adversarial attempts, while strategies like limiting interactions, delaying decisions, or adding noise make exploitation harder. In some cases, trading off performance for robustness—by removing vulnerable features or increasing regularization—can significantly reduce risks. Tools like feature flags let you respond quickly, and reducing feature granularity minimizes the attack surface.

By combining these strategies, you can build real-time ML models that are both effective and resilient. The goal isn’t to stop all attacks but to make them so costly that attackers move on.

If you want to strengthen your ML systems, keep reading our blog for more insights. And if you’re passionate about building cutting-edge ML solutions, we’re hiring! Join us in building the Purple Future.