Author: Daniel Braithwaite and Hiroto Udagawa

The work described here is a collaborative effort by many engineers at Nubank (alphabetical): Abhishek Shivanna, Arissa Yoshida, Austin McEver, Brian Zanfelice, Cristiano Breuel, Evan Wingert, Fabio Souza, Felipe Meneses, Helder Dias, Henrique Fernandes, Liam O’Neill, Marcelo Buga, Matheus Ramos, and Misael Cavalcanti. We also thank Rohan Ramanath, Daniel Silva, and Guilherme Tanure for their support.


This is the second part of a series of blog posts about modeling customer finances through foundation models. For an introduction to the problem, read our first blog post

Our previous post, “Understanding Our Customers’ Finances Through Foundation Models,” introduced how Nubank leverages foundation models to build user representations from transaction data. We discussed how these models, trained on vast amounts of unlabeled data using self-supervised learning, can automatically discover general features and create informative embeddings of customer financial behavior. This approach moves beyond traditional tabular feature engineering from transaction data sources, aiming to improve our understanding of customer needs.

We use transformers [1] for our sequence-to-sequence backbone because they are currently the dominant architecture for solving sequence-based modeling problems. In addition, they are good at utilizing long-range relationships in the input sequence (compared to RNNs [2, 3]). This may be especially useful for transaction data, given the seasonal and cyclical variations in spending habits (e.g., people might spend more money during the holiday season). 

Transformers operate on sequences of embeddings. Hence, we need to define a process for converting a user’s transaction into an embedding that a transformer can process. Here, we introduce one such way to formulate this interface, which, in essence, converts our transaction modeling problem into something we can solve with standard natural language techniques.

For this post, we will assume a transaction consists of three attributes: the amount represented as a floating point number, the date represented as a timestamp, and a description represented as a string. While this setup is simplistic, in practice, we can build representations and encoders for the many unique attributes of transactions, such as merchant ID, location, category, transaction denial reason, etc.

One straightforward option for building this interface is to assign an ID to each unique transaction, similar to the sequential recommendation literature, e.g., SASRec [4]. For example, we can map each unique combination of description, date, and amount to a different ID and then convert these IDs into embeddings using an embedding lookup table (learned simultaneously with the transformer). However, this has two key disadvantages. Firstly, the number of possible transaction configurations is very large, meaning we would need a massive ID space. It’s possible to reduce this space by using something like the merchant ID in place of the description, but this removes potentially important information and relies on additional pre-processing dependencies, making the models more brittle. Secondly, this approach also suffers from the cold start problem, where the model cannot handle transactions not seen during training.

Another potential interface between transactions and transformers is to follow the text-is-all-you-need [5] approach, which, in the case of product recommendation, constructs item strings by concatenating the attribute name (e.g., description, amount, date) and attribute value (e.g., “NETFLIX.com”, $32.40, 12/05/2023, respectively) together. Such a technique can also be applied to transaction data using the same process. These strings can then be treated as natural language and converted to embeddings using a tokenizer and an embedding table. This allows the model to generalize to unseen transactions and even leverage existing LLMs without any modifications. However, it also leads us to represent a transaction using many tokens, a problem since attention compute scales quadratically in the context length.

The figure above gives an example of the two aforementioned transaction-to-embedding interfaces. However, for our foundation models, we chose a modified version of text-is-all-you-need by representing numerical and categorical features using special tokens (numerical features are first quantized into bins to make them categorical). For the experiments in this blog post, we represent a transaction as the following:

  1. Amount Sign: A token representing whether the transaction amount is positive or negative.
  2. Amount Bucket: The absolute amounts are binned, and a separate token is assigned to each bin.
  3. Month, Day, and Weekday: Each is represented with their own tokens.
  4. Description: The text is tokenized as natural language using a standard tokenizer (e.g., BPE).

This representation also makes it straightforward to include additional categorical or numerical features. Moreover, using special tokens saves tokens over a fully text-is-all-you-need approach. A complete example is shown in the following figure:

Now that we have a procedure for representing a transaction as a string and tokenizing it, we can apply this to a member’s account by simply concatenating the transaction strings with intermediate separator tokens. This string is truncated once we reach a predefined context length limit.

Finally, we pre-train transformers using standard language modeling tasks. For example, as with transformers trained on natural language, transaction tokens are embedded using a lookup table, and we train using a standard self-supervised task like next token prediction (NTP) or masked language modeling (MLM).

Of course, the class of transformer models is very large and requires a great deal of exploring. The figure below shows the average performance of the unsupervised embeddings (i.e., embeddings from the pre-trained model) across four benchmark recommender system tasks as we adjusted the model configuration. In the interest of experimentation velocity (and hardware availability), we choose a short context length of 1024, denoted CL.

  • Model 1 (Baseline): Bidirectional attention with the MLM task. 
  • Chunked Model 1: Artificially extends the context length by running the same model over multiple independent blocks and averaging the embeddings (8 * CL token chunks). This produces an improvement of 1.64 points over the baseline.
  • Model 2: Uses a sparse attention mechanism to extend the context length further to 2 * CL without getting out-of-memory errors. This yields an improvement of 1.8 points over the baseline.
  • Chunked Model 2: Yields a relative improvement of 2.78 points over the baseline.
  • Model 3 SM: Switches to a causal attention setup and adds NoPE [6] (i.e., removes positional embeddings because causal models learn their own form of position information) and FlashAttention [7] (a more efficient implementation of the attention operation with better scaling properties), facilitating a context length of 8 * CL. This yields an improvement in performance of 3.93 points over the baseline.
  • Model 3 LG: Increases the parameter count by a factor of 4 and yields an improvement in performance of 7.2 points over the baseline

Finally, Fine-tuned LG corresponds to applying an additional finetuning procedure to Model 3 LG, where the transformer is trained to predict the label directly. The fine-tuned model improves by 9 points over the baseline.

For context, a relative improvement of 1 ~ 1.25 points on these tasks is significant enough to release a new model. 

In this blog post, we have reduced the problem of modeling sequences of user transactions to standard natural language methods. Despite its simplicity, this approach generates informative embeddings that add value to numerous downstream tasks across Nubank. We also saw a preview of the advantages of fine-tuning. 

This blog post is the second in a series that focuses on Nubank’s use of transaction foundation models. These models allow us to understand our members’ finances and meet their needs at the right time. In the next post, we will discuss our approach to joint fusion, which allows us to fine-tune these transaction foundation models for specific tasks and incorporate existing tabular feature solutions. We also plan to explore other aspects of our foundation modeling setup in later blog posts.

Series Summary

If you’ve made it this far, take a moment to check out the rest of the blog series for more context and technical depth.

  • In the first blog post, we evaluated the potential of foundation models for transaction data, showing how self-supervised learning can generate general-purpose embeddings that capture customer behavior without relying on labeled data.
  • In the second blog post, we went deeper into the technical formulation of our foundation models, detailing the causal transformer architecture and how these general embeddings can be applied across multiple downstream tasks.
  • In the third blog post, we explored how to improve task-specific performance through supervised fine-tuning and introduced joint fusion—a method for combining sequential transaction data with tabular features in a single, end-to-end training process.

Check our job opportunities

References

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

[2] Rumelhart, David E; Hinton, Geoffrey E, and Williams, Ronald J (Sept. 1985). Learning internal representations by error propagation. Tech. rep. ICS 8504. San Diego, California: Institute for Cognitive Science, University of California.

[3] Jordan, Michael I. (May 1986). Serial order: a parallel distributed processing approach. Tech. rep. ICS 8604. San Diego, California: Institute for Cognitive Science, University of California.

[4] Kang, W. C., & McAuley, J. (2018, November). Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM) (pp. 197-206). IEEE.

[5] Li, J., Wang, M., Li, J., Fu, J., Shen, X., Shang, J., & McAuley, J. (2023, August). Text is all you need: Learning language representations for sequential recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 1258-1267).

[6] Kazemnejad, A., Padhi, I., Natesan Ramamurthy, K., Das, P., & Reddy, S. (2023). The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 24892-24928.

[7] Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35, 16344-16359.

Check our job opportunities