In Data Science, effective query management is pivotal. Dive into Nubank’s strategy, where the power of Scala and Spark is leveraged to ensure efficient data transformations. 

Discover how Nubank’s intricate data architecture, which includes extraction, transformation, and loading, works seamlessly with Scala to maintain data consistency and governance.

This guide offers a deep dive into the coding nuances and real-world examples, showcasing the immense capabilities of Scala and Spark in handling complex queries.

A brief introduction to Nubank’s data structure

Before delving into the programming part, it’s crucial to understand Nubank’s data structure. In a nutshell, our data architecture comprises three standard parts:

  1. Extraction: Data is sourced from microservices or other data sources, like Brazilian Statistics Institute (IBGE), Central Bank, etc. The data is then extracted daily and stored in Amazon’s S3.
  2. Transformation and loading: This process happens once a day. Herein, we manage our transformations using a repository of queries. Each ‘block’ in our data structure represents a dataset or model. To give you some context, we manage over 60,000 datasets with contributions from over a thousand people every month.
  3. Availability and usage: After processing, the data is loaded into a new S3 and Google Cloud. This data can be accessed through Databricks, BigQuery, etc., offering a democratized data environment.

Check our job opportunities

The role of Scala and Spark

Scala and Spark come into play during the transformation stage. With the vast amount of transformations happening, it’s imperative to manage our queries effectively, ensuring data consistency and governance. This is where Scala and Spark shine.

Diving into the coding aspect, it’s important to say that we use Databricks, which is similar to Jupyter. It’s an execution environment styled as a notebook.

A practical example with Scala and Spark

Let’s demonstrate a simple query constructed using a ‘Base Table’. This table consists of transaction details, such as the request date and client information.

  1. Filtering data: Just like SQL, Spark’s language enables filtering data efficiently. For example, by using the ‘where’ clause, transactions with a ‘completed’ status can be filtered out.
  2. Adding columns: We can add a ‘requestDate’ column to our data using the ‘withColumn’ function in Spark. This column derives data from the ‘requestTimestamp’, converting timestamps into a date format.

While the initial code consists of only a few lines, the power of ScalaSpark allows us to rewrite transformations as chains based on pure functions. This structured approach helps in managing complex queries more efficiently.

MetaData: an essential ingredient and harnessing the power of Scala

When handling vast amounts of data, metadata becomes crucial. It provides information about the query, its author, the description, and more.

While it’s possible to maintain a separate metadata repository, integrating the query with its metadata is more efficient, streamlining the process of data transformation and management.

Scala’s unique selling proposition is its ability to combine object-oriented programming with functional programming seamlessly.

To structure our metadata better, we introduce ‘Spark Query’—a Trait in Scala (akin to an interface in Java or abstract class in other languages). This allows us to define a query with its associated metadata.

For instance, a ‘Spark Query’ can have:

  • tableName: Defines where the query result will be stored.
  • description: Offers a brief about the query’s purpose.

Setting the stage

Imagine the challenge: you have a query, and you need to slot it into your pre-defined structure. This requires you to create an object for the previously defined query. For the sake of this tutorial, we can name it CompletedPixTransactions.

Creating our object

To implement a thread in Scala, we use the keyword extends. After extending our thread, we’ll define all necessary elements inside. Since Scala boasts impressive type inference capabilities, there’s no need to explicitly specify the type for everything. For instance, if tableName is a string, we simply define our string.

For this exercise, our tableName will be named CompletedPixMovements. This is where our data will reside. To provide a brief description or what we’d call an ExampleDataset, consider it as a snapshot of our ongoing project. The owner team (let’s call it OwnerTeam) doesn’t need to be understood in depth for this context. Our data pertains to Brazil, so naturally, all our content is in Portuguese.

After evaluating our query, we concluded its QualityAssurance level is high, indicating a well-written and precise business concept. As for our inputs, we’ve kept it straightforward using just one table: PixMovements.

Building metadata and the query

Now comes the crucial part. How do we draft our query? We’ll retrieve our previously written query and embed it within our encapsulation structure. With Scala, when constructing an object in this manner, there’s freedom to add supplementary details. If desired, a frequency metadata can be appended to indicate the query’s execution frequency.

Unique functions exclusive to this Spark query can also be nested within. This enhances clarity, as when the query is written, these functions can be referenced directly.

The next step is to extract our input data from the dataframe, apply the necessary transformations as before, and avoid any display functions as our primary concern is the execution of the query, not its display or saving mechanism.

On successful execution, this approach helps us encapsulate both our query and its associated metadata within a single, standardized object.

Execution and results

The process to execute this encapsulated query is quite intuitive. First, specify your query’s input, which is typically a map, then access the data within the table using the map’s name. This object can then call the defined query and input data into it. If executed correctly, you should see similar results as before.

For saving results into a table一a standard procedure in pipelines一you can utilize Spark’s commands. The table name is predefined within our object as TableName. By executing, the query results will be stored in this specified table.

In summary, we’ve successfully encapsulated a basic query and its metadata into a standardized object, offering a clear and robust method for managing Spark queries with Scala.

A step further: multiple queries

Real-world production environments don’t just deal with one query but potentially hundreds. Handling such volumes requires a more advanced approach.

In our initial segment, we learned how to craft a Spark query using Scala’s API, emphasizing pure functions and enhanced governance layers. As we move forward, we’ll delve into harnessing the combined power of object-oriented programming and functional language with Scala.

Within our Spark Query, we possess an attribute named QueryInputs.This attribute contains a list detailing the inputs for a specific query.

For example, if we want to determine the number of inputs for a given query, we could easily do so by calling the .size method on our list.

Creating the getNumberOfInputs function

To demonstrate, we can craft a function named getNumberOfInputs that fetches the number of inputs. When this function is applied across a list of queries, it would yield a list of integers, representing the number of inputs for each query.

This transformation is made possible using the map operator. The map operator applies a given function to each item of an input list, creating a new list that contains items returned by the function.

Understanding the reduce operation:

Another crucial operation in functional programming is reduce. The objective of reduce is to take a list of elements and condense it into a single value.

For instance, given a list of numbers, one could use reduce to compute the average or the sum of the list. The function accepted by the reduce method typically takes in two arguments and returns one output.

To visualize, imagine applying a function sequentially to pairs of elements in a list, gradually narrowing down to a single accumulated result.

Summing the inputs

To put this into perspective, consider a scenario where we have a list of integers representing query inputs. We could define a function, sumNumberOfInputs, that computes the sum of two integers.

When applied over our list using reduce, this function would accumulate the total number of inputs across all queries.

Identifying queries from the PIX engineers team

To isolate queries crafted by the PIX Engineers team, we employ the filter operation. This operation processes a list, retaining only elements that match a given condition.

For our use case, we define a function, isFromTeamPixEngineers, which checks if a given query originates from this specific team.

Isolating low-quality queries

From our filtered list of PIX Engineers’ queries, we might want to discern which of them are of low quality. We can introduce another filtering condition, checking the QualityAssurance attribute of each query against the value ‘low’.

A common requirement in ETL environments is identifying dependencies. Suppose we want to pinpoint which datasets rely on the Meetup PixMovements dataset. We can craft another filter function to sift through our list and find queries dependent on this specific dataset.

This exercise is pivotal, especially when considering the ripple effects of altering a foundational query.

Challenging the norms: asking more from Scala

After establishing a robust query system, what next? Here are some provocations:

Implementing stringent tests

  1. Ensuring high validity: For instance, all datasets under the PixEngineers team should maintain high validity. We can then only accept transformations adhering to this rule.
  2. Limiting query inputs: No query should depend on more than 10 inputs. The idea is to prevent exceedingly complex dependencies that could be hard to debug or maintain.
  3. Avoiding cyclic dependencies: For instance, if dataset A depends on dataset B, and B relies on C, then C should not depend on A. Such cyclic dependencies can introduce subtle bugs and performance issues.

These are practices we’ve ingrained into our data governance at Nubank, ensuring control and consistency.

Advanced Scala capabilities

Scala’s flexibility offers even more advanced possibilities:

  1. Visualizing dependencies: Given a list of all queries, Scala allows us to interpret the dependency graph and visualize it. Each dataset is represented as a block, with chains showing their dependencies. This visualization can be enriched by color-coding datasets based on their quality, enabling quick insights into potential areas of concern.
  2. Metadata as data: Another intriguing capability involves treating metadata as actual data. By capturing the transformation metadata, we can create a table and integrate it into our ETL processes. This data can then be consumed and monitored, offering real-time insights.

For instance, with just a few lines of code, we can create a dataframe, which then lets us analyze data distribution, such as the number of datasets per country or owner team, in real-time.

Mastering Scala and Spark proves instrumental in optimizing data transformations and management. As illustrated by Nubank’s methodology, integrating both tools offers a comprehensive solution to the challenges of data governance.

With features like encapsulating query and metadata, visualizing dependencies, and treating metadata as actual data, Scala emerges as an invaluable asset for data professionals. As data ecosystems continue to grow and become more complex, adopting such strategies becomes crucial for businesses aiming for efficiency, clarity, and robust data governance.

Check out what we shared about this topic on Meetup below:

Check our job opportunities