The digital transformation of the financial sector has placed data processing at its core, with companies like Nubank leveraging sophisticated platforms like Spark for their operations. Spark, an open-source distributed computing framework, offers a robust solution for handling large data volumes.

Yet, mastering its intricacies is no simple task. Read on to delve into the mechanics of Spark, the unique data processing challenges Nubank encounters, and the strategies we employ to navigate these complexities.

A brief introduction to Spark

Spark is essentially a distributed computing framework, executing codes in parallel across various machines. Primarily designed for Big Data processing, it’s notable for being completely open source.

How does Spark work?

The architecture of Spark comprises various components, of which the following are fundamental:

  • Cluster: A group of JVMs.
  • Driver: Responsible for interpreting the user’s program (query), analyzing, distributing, and scheduling the tasks across executors. Think of the driver as the brain behind the operations.
  • Executor: These do the heavy lifting. They process specific data segments, known as partitions, executing the code designated by the driver.

Check our job opportunities

Nubank’s data processing challenges with Spark

With an expanding product suite and rapid growth, Nubank identified a need for better communication orchestration to prevent customers from receiving simultaneous communications about various products.

Setting the stage: communication orchestrator

To address this, an infrastructure was set up within our ETL (Extract, Transform, Load) process. This infrastructure was composed of modules like:

  • Product module: Allowing users to add new products they wanted to communicate about.
  • Customer segment module: Facilitating communication based on various criteria, such as age or income.
  • Campaign module: Specifying the communication type – be it an email or a notification.

Users added new customer segments and campaigns to this structure. These campaign details were then persisted into datasets. To pinpoint which customer segment would receive which campaign, a “join” operation was performed between segments and campaigns. This combination was termed as the “pipeline”.

The problem at hand

Initially, users would add a handful of segments and campaigns for the same product. However, in certain instances, the number of added segments and campaigns escalated, leading to a ten-fold increase in campaign datasets. This was coupled with a significant rise in segment creation, which weren’t persisted in datasets.

This unforeseen surge led to error messages cropping up. On Databricks, a popular analytics tool based on Spark, the “driver stopped unexpectedly” error appeared. On the ETL side, a “timeout exception” error was flagged.

Potential culprits: broadcast join or out-of-memory-on-driver

Two primary suspicions emerged:

  1. Broadcast join: Spark chooses the type of join operation based on data size. When there’s a significant disparity between dataset sizes, it may opt for a “broadcast join.” If Spark mistakenly opts for this join type, it might exceed the preset broadcast timeout, causing the error.
  2. Out-of-memory-on-driver: This can occur when an action in a query sends results to the driver that’s too large for its memory, leading to memory exhaustion.

Using SparkUI on Databricks, it was discerned that the joins made between segments and campaigns weren’t of the broadcast type, but rather the “sort merge” type. This observation eliminated the broadcast join hypothesis.

Further investigation showed that the process halted during the union of several tables. Interestingly, when shifted to a general cluster with more memory, the query ran smoothly, hinting at a memory issue.

When merging a multitude of datasets, there’s a risk of overburdening the driver. To preemptively mitigate this, we’ve instituted tests to ensure users don’t add more than a certain number of segments – specifically, 40. This number acts as a buffer before system breakdown.

Data streams

At the heart of our challenges are the colossal data streams coming from the app’s events. With over 59 million users, we register:

  • Over 1 billion triggered events daily.
  • An aggregate of over 100 terabytes of event data that necessitates daily processing.

Some primary challenges regarding app events are:

  • Deduplication of events: Often, clicks are registered multiple times and need deduplication within Spark.
  • Daily processing of data: Ensuring that the latest data is always available to our analysts.
  • Data exploration: The vastness of data makes exploration intricate on our platforms.

The solutions these problems include:

  • Deduplication: By ensuring events are produced and delivered only once (both in-app and at the Message Broker level), the onus on Spark for deduplication is minimized.
  • Incremental processing: Instead of processing the entire database daily, process data incrementally – one day at a time.
  • Data filtering: Offering analysts filtered datasets can expedite data processing on platforms like Databricks or BigQuery.

Handling smaller datasets

Conversely, we also need to analyze smaller data sets like customer satisfaction surveys. The challenge here is different. When dealing with modest data sizes, such as 20 MB, Spark might seem like an overkill. But centralizing data for company-wide analysis justifies the approach.

CSV challenges:

  1. Schema guarantee: Ensuring the structure of the CSV remains consistent.
  2. Data lineage and versioning: Keeping track of data versions, especially when received ad-hoc.
  3. Spark overhead: The overhead of using Spark for smaller datasets.

The solutions include:

  • Guaranteed schema in JSON: If a CSV doesn’t conform to the guaranteed schema, it won’t be processed.
  • Automated versioning: All versions of the data are stored in the Data Lake automatically. Alongside data, schemas are versioned too.

Diving deeper: specific Spark challenges

When dealing with varying data sizes, Spark brings its own set of complications:

  1. Spill: Data is moved to disk if it can’t fit into RAM. If the disk runs out of memory, processing can terminate.
  2. Skew: This occurs when one data partition is significantly larger than others. In extreme cases, while all other partitions have concluded, one remains processing.
  3. Shuffle: Moving data between partitions, especially during operations like join or groupby, can be time-consuming.

To solve these problems, we use:

  • Parameter optimization in Spark: Default configurations might not suffice, so optimization based on data origin is essential.
  • Shuffle partitions and maxPartitionBytes: These parameters can be tweaked to manage the data partitions better.

Embracing Spark 3

The recent migration to Spark 3 has introduced a plethora of parameters for optimization. Features like AQE help handle skew joins automatically. There’s an inherent ability to switch from sort-merge join to broadcast join during runtime. Spark 3 also automates partition configuration, ensuring optimal partition sizes during processing.

Nubank’s journey in processing massive amounts of data using Spark offers an illuminating look into the practical challenges and potential solutions for enterprises navigating the Big Data era. It underscores the need for agility, foresight, and a proactive approach to managing and optimizing data processing workflows.

With Spark continuously evolving, as evident from the transition to Spark 3, organizations can look forward to even more refined tools and features. By learning from pioneers like Nubank, businesses can better position themselves to capitalize on the benefits of Spark and drive their data strategies to new heights.

Check out what we shared about this topic on Meetup below:

Check our job opportunities