Authors: Daniel Braithwaite, Arissa Yoshida, Rafael Celente, and Aman Gupta

In previous blog posts [1,2,3], we introduced Nubank’s approach for using transaction data-based foundation models to solve predictive problems [4]. These posts described how we formulate our transaction data for foundation models [2], pretrain these models, and finally finetune them (via joint fusion) for specific downstream tasks [3]. Importantly, we saw large improvements on tasks that are critical to Nubank. The most significant result was that the improvements were achieved not by using additional data sources, but rather by learning optimal transaction features as opposed to using handcrafted ones. 

While powerful, these foundation models are computationally costly to train. At Nubank, we are always looking for ways to improve data efficiency to both reduce costs and build better-performing models. In this post, we explore how a novel optimizer, Muon [5], is helping us achieve these goals. Muon has recently received a significant amount of interest from the LLM research community, particularly for being sample-efficient to achieve a fixed quality for pre-training, when compared to AdamW (which happens to be the de facto choice for most pre-training workloads). 

The quality of our foundation models increases as a function of the amount of data used, up to and beyond 203M rows. For example, in Figure 1, we demonstrate how the test set AUC for one of our smaller models (24M parameters) scales as a function of the number of joint fusion data points. Even slight improvements, such as a 0.05% increase in AUC, are highly valuable because they could lead to millions of dollars in savings for Nubank. However, while the AUC improves, so does the training cost. Joint fusion [3] with 5M rows takes around 12 hours with 8 NVIDIA A100 GPUs, whereas with 40M rows it takes around 95 hours using the same 8 A100s.

Figure 1 – Model quality improves as a function of the dataset size

The aforementioned computational cost of training these models shows that it is important to use methods that improve data efficiency. On the other hand, it also means we can achieve better performance with the same number of training steps. There are various methods we can use to improve data efficiency. However, in this blog post, we explore using the Muon [5] optimizer to make our foundation model pre-training more data efficient. In turn, these improved foundation models will lead to cost savings and better product performance for Nubank’s customers.

The Muon optimizer [5] represents a significant shift from the long-dominant, heuristic-based approaches like AdamW, introducing a simple second-order optimization method derived from first principles. Specifically designed for the dense linear layers of neural networks, Muon’s core mechanism can be described as matrix-structured steepest descent with spectral norm regularization. Its fundamental operation involves “orthogonalizing” the gradient matrix for each weight layer by pushing all the singular values to be close to 1. This process preserves the directional information of the gradient while normalizing its magnitude across all directions, preventing the optimization from being dominated by a few noisy or less useful gradient components. This theoretically elegant concept is made practical through the use of the efficient Newton-Schulz iteration [6], which approximates the orthogonalization without the prohibitive computational cost of a full SVD computation.

This principled design directly translates to substantial gains in both data and computational efficiency. Muon’s orthogonalized momentum updates allow for more stable and direct steps toward the loss minimum and enable the model to learn more from each token it processes. The efficiency gains are substantial from a computational standpoint. Scaling law experiments consistently demonstrate that Muon can achieve model quality comparable to that of an AdamW-trained counterpart while consuming only about half (~52%) of the training FLOPs, which corresponds to an approximate 2x improvement in computational efficiency [7,8].

To test our hypothesis that Muon can lead to better foundation models for Nubank, we pre-trained several 330M-parameter models on a 20M-sample dataset. We compared the performance of the Muon optimizer against the widely used AdamW optimizer across four different learning rates: 1e-4, 2e-4, 1e-3, and 2e-3. The figure below shows these results. Importantly, we see Muon converges significantly faster than AdamW, and converges to solutions with lower validation losses for each of the learning rate selections.

We can also combine all these runs into a single comparison, which shows the three best-performing models are Muon 1e-3, Muon 2e-3, and AdamW 1e-3. It is worth reiterating that the Muon runs converge faster than the best-performing AdamW run. These results confirm our hypothesis that using Muon can train better foundation models. An important sidenote is that the next token prediction losses are unusually low for language modeling because the specialized tokens utilized in our foundation models possess a restricted potential vocabulary.’

In this blog post, we demonstrate the advantages of integrating the Muon optimizer into Nubank’s foundation model pre-training pipeline. By adopting Muon, we have achieved faster convergence and superior model quality compared to the widely used AdamW optimizer, unlocking improvements in data and computational efficiency. These advancements directly translate into tangible benefits for Nubank: reduced training costs and enhanced product performance, ultimately delivering a better experience for our customers. Our findings confirm that sophisticated optimization techniques like Muon are crucial for pushing the boundaries of what is possible with large-scale foundation models, ensuring we continue to innovate efficiently and effectively.

References

[1] Braithwaite, D., & Udagawa, H. (2025, March 24). Understanding our customers’ finances through foundation models. Building Nubank. https://building.nubank.com/understanding-our-customers-finances-through-foundation-models/

[2] Braithwaite, D., & Udagawa, H. (2025, April 22). Defining an interface between transaction data and foundation models. Building Nubank. https://building.nubank.com/defining-an-interface-between-transaction-data-and-foundation-models/

[3] Braithwaite, D., Cavalcanti, M., & Udagawa, H. (2025, May 14). Fine-tuning transaction user models. Building Nubank. https://building.nubank.com/fine-tuning-transaction-user-models/

[4] Braithwaite, D. T., Cavalcanti, M., McEver, R. A., et al (2025). Your Spending Needs Attention: Modeling Financial Habits with Transformers. arXiv preprint arXiv:2507.23267.

[5] Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., & Bernstein, J. (2024). Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/

[6] Bernstein, J., & Newhouse, L. (2024). Old optimizer, new norm: An anthology. arXiv preprint arXiv:2409.20325.

[7] Shah, I., Polloreno, A. M., Stratos, K., Monk, P., Chaluvaraju, A., Hojel, A., … & Vaswani, A. (2025). Practical efficiency of muon for pretraining. arXiv preprint arXiv:2505.02222.

[8] Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y., … & Yang, Z. (2025). Muon is scalable for LLM training. arXiv preprint arXiv:2502.16982.

Check our job opportunities