most read
Software Engineering
Why We Killed Our End-to-End Test Suite Sep 24
Culture & Values
The Spark Of Our Foundation: a letter from our founders Dec 9
Software Engineering
The value of canonicity Oct 30
Careers
We bring together great minds from diverse backgrounds who enable discussion and debate and enhance problem-solving.
Learn more about our careers



The digital transformation of the financial sector has placed data processing at its core, with companies like Nubank leveraging sophisticated platforms like Spark for their operations. Spark, an open-source distributed computing framework, offers a robust solution for handling large data volumes.
Yet, mastering its intricacies is no simple task. Read on to delve into the mechanics of Spark, the unique data processing challenges Nubank encounters, and the strategies we employ to navigate these complexities.
A brief introduction to Spark
Spark is essentially a distributed computing framework, executing codes in parallel across various machines. Primarily designed for Big Data processing, it’s notable for being completely open source.
How does Spark work?
The architecture of Spark comprises various components, of which the following are fundamental:
Check our job opportunities
Nubank’s data processing challenges with Spark
With an expanding product suite and rapid growth, Nubank identified a need for better communication orchestration to prevent customers from receiving simultaneous communications about various products.
Setting the stage: communication orchestrator
To address this, an infrastructure was set up within our ETL (Extract, Transform, Load) process. This infrastructure was composed of modules like:
Users added new customer segments and campaigns to this structure. These campaign details were then persisted into datasets. To pinpoint which customer segment would receive which campaign, a “join” operation was performed between segments and campaigns. This combination was termed as the “pipeline”.
The problem at hand
Initially, users would add a handful of segments and campaigns for the same product. However, in certain instances, the number of added segments and campaigns escalated, leading to a ten-fold increase in campaign datasets. This was coupled with a significant rise in segment creation, which weren’t persisted in datasets.
This unforeseen surge led to error messages cropping up. On Databricks, a popular analytics tool based on Spark, the “driver stopped unexpectedly” error appeared. On the ETL side, a “timeout exception” error was flagged.
Potential culprits: broadcast join or out-of-memory-on-driver
Two primary suspicions emerged:
Using SparkUI on Databricks, it was discerned that the joins made between segments and campaigns weren’t of the broadcast type, but rather the “sort merge” type. This observation eliminated the broadcast join hypothesis.
Further investigation showed that the process halted during the union of several tables. Interestingly, when shifted to a general cluster with more memory, the query ran smoothly, hinting at a memory issue.
When merging a multitude of datasets, there’s a risk of overburdening the driver. To preemptively mitigate this, we’ve instituted tests to ensure users don’t add more than a certain number of segments – specifically, 40. This number acts as a buffer before system breakdown.
Data streams
At the heart of our challenges are the colossal data streams coming from the app’s events. With over 59 million users, we register:
Some primary challenges regarding app events are:
The solutions these problems include:
Handling smaller datasets
Conversely, we also need to analyze smaller data sets like customer satisfaction surveys. The challenge here is different. When dealing with modest data sizes, such as 20 MB, Spark might seem like an overkill. But centralizing data for company-wide analysis justifies the approach.
CSV challenges:
The solutions include:
Diving deeper: specific Spark challenges
When dealing with varying data sizes, Spark brings its own set of complications:
To solve these problems, we use:
Embracing Spark 3
The recent migration to Spark 3 has introduced a plethora of parameters for optimization. Features like AQE help handle skew joins automatically. There’s an inherent ability to switch from sort-merge join to broadcast join during runtime. Spark 3 also automates partition configuration, ensuring optimal partition sizes during processing.
Nubank’s journey in processing massive amounts of data using Spark offers an illuminating look into the practical challenges and potential solutions for enterprises navigating the Big Data era. It underscores the need for agility, foresight, and a proactive approach to managing and optimizing data processing workflows.
With Spark continuously evolving, as evident from the transition to Spark 3, organizations can look forward to even more refined tools and features. By learning from pioneers like Nubank, businesses can better position themselves to capitalize on the benefits of Spark and drive their data strategies to new heights.
Check out what we shared about this topic on Meetup below:
Check our job opportunities