most read
Software Engineering
Why We Killed Our End-to-End Test Suite Sep 24
Culture & Values
The Spark Of Our Foundation: a letter from our founders Dec 9
Software Engineering
The value of canonicity Oct 30
Careers
We bring together great minds from diverse backgrounds who enable discussion and debate and enhance problem-solving.
Learn more about our careers



Author: Daniel Braithwaite and Hiroto Udagawa
The work described here is a collaborative effort by many engineers at Nubank (alphabetical): Abhishek Shivanna, Arissa Yoshida, Austin McEver, Cristiano Breuel, Brian Zanfelice, Evan Wingert, Fabio Souza, Felipe Meneses, Helder Dias, Henrique Fernandes, Liam O’Neill, Marcelo Buga, Matheus Ramos, and Misael Cavalcanti. We also thank Rohan Ramanath, Daniel Silva, and Guilherme Tanure for their support.
Translations Reviewers: Felipe Almeida and Kevin Rossell
Predictive models are the underpinnings of many critical systems within financial institutions, e.g., risk prediction, fraud detection, and product recommendations. To power these predictions, most digital banking platforms have access to large amounts of user data from bank transactions and in-app events to chat logs with customer support. Combined, these sources can give us rich insights into what our customers need from their trusted financial institution. Historically, however, these data sources have been used to extract useful but relatively simple features to solve the aforementioned predictive tasks. In this post, we propose developing foundation models for financial data, specifically transactions. These foundation models facilitate the automated discovery of general features from transactions. Moreover, these general features are useful for solving many tasks across Nubank.
Traditional ML models have been built for tabular features (e.g., numerical, categorical), which have become the norm for industry machine learning systems due to their simplicity, interpretability, and robustness. While these approaches work well in practice, designing the tabular features is intensive and requires much trial and error. In other fields, however, ML has advanced towards learning representations directly from the raw data for supervised learning tasks. One common example is that convolutional neural networks automatically learn features like edges, textures, and shapes from the raw images [1, 2]. Such a setup facilitates automatically learning features that solve the supervised learning task, hence avoiding the need for manual feature engineering. Despite these more advanced modeling techniques existing in other domains (language, vision, sequential recommender systems, etc.), most financial industry applications of ML have lagged behind.
One of the most significant recent trends in machine learning is the notion of foundation models, which learn generic embedding representations from raw data such as text [3], images [4], and events [5]. These models are trained on massive amounts of unlabeled data and leverage self-supervised learning (SSL), which involves implicitly constructing pseudo labels from the data, e.g., predicting the next words in a sentence. Using SSL allows foundation models to learn informative representations of the inputs without explicit labels. These representations can then solve diverse downstream tasks with greater accuracy, all relying on the same base model. This is in contrast to manually engineered features or those learned with supervised techniques, both of which are often problem-dependent.
Interestingly, scaling foundation models can also result in emergent properties. For example, large language models learn to perform tasks like question answering or text summarization simply by observing natural language [6]. As a result, we hypothesize that by building foundation models from bank transactions and other data sources within a financial institution, we can understand our customers beyond the capabilities of existing methods.
At Nubank, we are developing foundation models from scratch to allow teams to unlock the signals from the vast amounts of financial data that customers produce daily. Furthermore, we have developed an in-house AI platform to extend these models beyond transactions, considering all user interactions (e.g. app events) and new transaction streams. Teams across Nubank can leverage a central repository of foundational models and fine-tuning pipelines to solve their specific tasks.
In this post, we explore foundation models, specifically within the context of transaction data. Despite the success of foundation models in other fields, we have found limited publicly available work in our domain of interest. Moreover, in the available literature [7, 8], the scale is not close to the volume of data we have available at Nubank. For example, [8] uses billions of transactions, whereas we have access to trillions of transactions and events across Nubank’s 100M+ customers. As mentioned, this is important because data volume is essential for discovering the emergent properties from large foundation models.
Our goal is to ingest a customer’s time-ordered transactions and represent their financial behavior as an embedding. Each transaction is represented by text along with numerical and categorical attributes. As is common in other domains like natural language, images, and audio, we find it possible to efficiently summarise customer behavior by learning to predict their future transactions. The general structure of our foundation model is shown in the figure below. In the remainder of this post, we introduce some key components of this model, the details of which are left for follow-up posts.
The transformer [9] backbone operates on sequences of embeddings. Hence, we must define an interface between the transaction and these sequence-to-sequence models. This allows us to build our own foundational transformer models by pretraining (from scratch) on Nubank’s corpus of user transactions. As discussed, a key advantage of these user embedding models is that they alleviate the need for manually engineering features from this data. Furthermore, we observe promising scaling laws, where these user representations become more powerful across tasks as we increase the data, compute, and model size.
However, for many downstream tasks, teams have existing solutions based on tabular features derived from both transaction and non-transaction sources. It is important that any foundation model-based solution can blend its embeddings with these existing tabular features. This speeds adoption as we can rapidly demonstrate lift on top of any existing model. To facilitate the aforementioned blending of embeddings and features, we developed an end-to-end fine-tuning process, which trains a DNN to blend embeddings and tabular features while jointly fine-tuning the foundation model. This approach optimizes the foundation models for any specific downstream tasks and achieves state-of-the-art performance. We also hypothesise that joint fusion facilitates learning an embedding that contains signals that are orthogonal to what is already captured by the tabular features.
This blog post gave a high-level introduction to Nubank’s approach to leveraging foundation models for financial data, transforming raw transactions into actionable insights. While these foundation models build on standard data sources used throughout the industry, they facilitate automatically learning informative features that may not be obvious to data scientists. Lastly, and most importantly, the features generated by these foundation models improve Nubank’s ability to understand its consumers so we can help them meet their financial needs at the right time. In future blog posts, we will explain in more detail the key aspects of this model.
Series Summary
If you’ve made it this far, take a moment to check out the rest of the blog series for more context and technical depth.
Check our job opportunities
References
[1] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
[2] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[3] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.
[4] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
[5] Pancha, N., Zhai, A., Leskovec, J., & Rosenberg, C. (2022, August). Pinnerformer: Sequence modeling for user representation at pinterest. In Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining (pp. 3702-3712).
[6] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.
[7] Babaev, D., Ovsov, N., Kireev, I., Ivanova, M., Gusev, G., Nazarov, I., & Tuzhilin, A. (2022, June). Coles: Contrastive learning for event sequences with self-supervision. In Proceedings of the 2022 International Conference on Management of Data (pp. 1190-1199).
[8] Skalski, P., Sutton, D., Burrell, S., Perez, I., & Wong, J. (2023, November). Towards a Foundation Purchasing Model: Pretrained Generative Autoregression on Transaction Sequences. In Proceedings of the Fourth ACM International Conference on AI in Finance (pp. 141-149).
[9] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Check our job opportunities