Author: Taylor Foust

The work described here is a collaborative effort by many engineers at Nubank (alphabetical): Austin McEver, Brayan Garzon, Daniel Braithwait, Fábio Souza, Gabriel Gandour, Gustavo Vieira, Helder Dias, Hiroto Udagawa, José Mora, Lucas Costa, Marcelo Buga, Matheus Ramos, Neriton Tolentino and Stelios Karvanis. We also thank Rohan Ramanath, Daniel Silva, and Guilherme Tanure for their support.


Introduction

In our previous blog posts, we introduced the concept of how Nubank uses foundation models to learn optimal feature representations for predictive tasks. Specifically, we demonstrated how bank transaction data can be represented as a sequence of tokens constructed from transaction attributes (e.g., date, amount, description). Transactions are combined with other events and background information to form a user representation (or narrative) which we then tokenize and pass to our foundation models. The foundation models then convert the sequence of tokens into optimal feature representations for their learning tasks. These user narratives are not fixed – we need to choose what information to include and how to represent it in a way that is optimally useful for our foundation models.

In the remainder of this post, we show that constructing these user narratives can be thought of as a hyperparameter search across two primary dimensions: which sources of information to include and how to represent the events that comprise each source. We present our experimental approach to exploring this hyper-parameter space, and show that this process dramatically reduces the effort required to incorporate new data sources, eliminates guesswork and improves model performance.

Check our job opportunities

Including a Transaction in a User Narrative: How do we represent a transaction?

Let’s introduce the process with a simple example: a user makes a credit card purchase. In its simplest form, we could represent this event with a single token, <credit-card-purchase>, but there is a lot more to the story.

Figure 1

When did the purchase occur? With what merchant was the purchase made? Was the purchase made in-person or online? How much money was spent? Did the purchase succeed? If the purchase failed, for what reason? What is the account balance after the purchase?

Intuitively, one might think that using as much information as possible in a transaction representation would lead to optimal model performance – the model can just learn to ignore the irrelevant information. However, we have found through experimentation that adding tokens (even ones that seem like they would be informative) does not always improve model performance.

Our models, like all transformer architectures, have limited context length due to the quadratic scaling properties of attention, even with the use of efficient algorithms like FlashAttention. This means that when the context window is full, any additional information you pass will result in other information being forced out. So we need to ensure that each token that occupies space in the context window is earning its spot – providing information that is useful to the modeling task and not repeated in other tokens. In a sense, we can think of this process as editing a user document – ensuring wording is concise and that we only keep the necessary information.

Figure 2

Optimizing Transaction Representations

Our objective in creating user documents is twofold: include as much useful information as possible and represent that information in as few tokens as possible. So, how do we know if a given token is worth including? We determine the effectiveness of token selection via experimentation and empirical testing against offline evaluation metrics. 

To help illustrate the process, take the following example user: abc-123. They have four transactions: a transfer in and three credit card transactions. These are depicted in Figure 3 below:

Figure 3: example transactions for user abc-123

For each of these transactions, there are a number of attributes that may be useful to our models: the amount of the transaction, the date of the transaction (year, day, weekday, month, etc.), the status of the transaction, the source of the transaction, a description of the transaction, the denial reason (if applicable) and the entry mode (how the transaction was made). For each of these attributes, we create pre-processing modules that add the required new tokens to the transaction representation. Many of these modules add special tokens (amount ranges, whether the amount was paid or received, date attributes, transaction status), while others (like description) are passed to the tokenizer as raw text and tokenized with the BPE algorithm.

For this experiment, we generate a number of different pre-processing modules: amount, date, description, source, status, denial_reason and entry_mode, which, when selected, will alter the representation of transactions feeding into the model. We can then select any combination of these pre-processing modules in model initialization, train a model, and generate evaluation metrics on a test set.

For trial 1, we initialize our models with pre_processors = [amount, date, description]. A tokenized version of transaction 4 is depicted in figure 4. For illustration purposes we are showing the text that each token represents (between the angle bracket pairs).

Figure 4

For trial 2,  pre_processors = [amount, date, description, source].  The representation becomes:

Figure 5

For trial 3,  pre_processors = [amount, date, description, status]:

Figure 6

For trial 4,  pre_processors = [amount, date, description, entry_mode]:

Figure 7

For trial 5,   pre_processors = [amount, date, description, denial_reason]:

Figure 8

For trial 6,   pre_processors = [amount, date, description, status, denial_reason]:

Figure 9

We train models and evaluate performance on our holdout set. There may be many more trials (combinations)  than we listed above. Sample results are depicted in Figure 10 below:

Figure 10

Though limited in scope for purposes of illustration, in this experiment we have found an optimized way of representing transactions, a key component of the user narrative. We first created re-usable pre-processors that convert transaction attributes into tokens, and we then perform hyper-parameter search to determine the optimal mix of these pre-processors. In this case we found that more information did not always lead to better performance: sometimes the inclusion of the pre-processors (or combinations thereof) degrades performance, often due to the finite context window.

What Sources of Information to Include?

In the examples above, we only discussed transfer and credit-card data sources, but Nubank has many more transaction sources, as well as others that are non-transactional. When assessing new data sources, we employ the same experimental approach to guarantee that the added data enhances our user representations and improves our benchmarks. For some sources, however, certain tokens may not be useful, and others may require the addition of new tokens, so we built our framework to enable each source to use separate pre-processing logic.

Nubank’s Money Boxes are investment accounts in which users can allocate funds to investments of their choice – often with a particular savings goal in mind. The structure of these transactions is similar to what we explored earlier with transfers and credit card transactions. We see amounts moving in and out, and typically, the description is the user-specified savings goal. A sample of this data is depicted in Figure 11 below.

Figure 11

Using the same pre-processing modules that we used on the previous transaction sources will capture a good amount of information (amount, date, description, source and status). But there are a few other fields that are unique, too, and potentially useful: transaction_type, investment_type and account_balance. We can create more modules that process this information into special tokens, and test the inclusion of the Money Box source with the standard pre-processors and with the new pre-processors. Running these experiments is nearly identical to the tests we demonstrated previously, but with the specification of a dataset containing the Money Box transactions, and with our new pre-processor modules activated for the Money transactions. So we can have the following additional trials:

Trial 7: pre_processors = [amount, date, description, status, denial_reason], dataset=”transactions_with_money_box”

Trial 8: pre_processors = [amount, date, description, status, denial_reason], dataset=”transactions_with_money_box”, pre_processor_overrides={“MONEY BOX”: [amount, date, description, status, transaction_type, investment_type, account_balance]}

For trial 8, we would add to our user document an entry like the following:

Figure 12

Similarly, for loan-related transactions, there are a number of additional attributes that we pass to the model that are not relevant to the other transaction we’ve mentioned so far. Loans have associated interest rates, remaining balances, number of payments made, etc. In order to leverage this important information and not fill the context window tokens that are irrelevant for other sources, we allow each source to extract only the tokens that it needs.

To represent a loan payment with the inclusion of interest_rate, number_of_payments_made and remaining_balance pre-processors, our transaction might look like the following:

Figure 13

We include a variety of other data sources that are even more divergent from typical transactions. For instance, in our Brazilian foundation models, we incorporate Credit Information System (SCR) data. This data is totally different from transactions, and takes the form of monthly updates to credit-related metrics for users. This data can provide valuable information on the history of users before they joined Nubank, and can offer a glimpse into the financial lives of our customers outside of Nubank. This helps to ensure that our models work robustly for new customers and those that use Nubank services more sparingly. After pre-processing, this data might take a form similar to the following:

Figure 14

Having the flexibility to easily accommodate different data formats is crucial for the speed of experimentation and enables us to fill in the gaps in our user understanding. As before, adding these new sources into our model and testing their impact on model performance simply involves creating the pre-processing module, selecting the dataset with the new source available and activating the pre-processing module. By optimizing these hyper-parameters we optimize user representations and are able to realize gains in model performance. We depict the observed model performance in Figure 15 below.

Figure 15

Writing User Documents

Putting together the data sources we talked about, the document for our example user takes a form similar to the following: 

Figure 16

Although it’s not a great read for a person, we have validated through experimentation that our models like it this way. This representation allows our models to understand the journey of this user from before they became a Nubank customer and through the adoption of a number of Nubank products. The model can understand the user’s financial situation, how it has changed over time, their spending and savings habits, their financial goals and setbacks, and the Nubank products they use to enhance their lives. Crucially, our role is not to come up with the final representation of the data for the model, but rather to present it in a way that the model can then optimally extract the information it needs and build the features that help it solve its objectives.

The selection and representation of data are critically important to every machine learning application, and among the most time consuming parts of the process. It can take machine learning practitioners weeks to evaluate a new data source, figure out how to derive tabular features from it, clean and pre-process these features and finally train a new model on this data and run through evaluation. 

With the foundation models at Nubank, we have simplified this process into a matter of augmenting the ways we write user documents. As we have shown, there is still nuance in this process and thinking is required, but the process is much simpler, faster, more flexible and comes with quantifiable feedback in terms of model evaluation performance. We are able to go from a raw data source to getting test results in just a couple days. And the representations of data do not have to conform to the typical constraints of tabular machine learning. As researchers at Nubank, we just have to consider: what information could possibly enrich the narrative we have for this user, and how can I present this information to our models in an efficient and useful way? We then present a variety of options and allow the model to select the representation it likes best.


Check our job opportunities