At the Nubank DS&ML Meetup #97, participants had the opportunity to dive deep into the world of streaming data infrastructure, guided by André Midea, an Engineering Manager at Nubank. 

With a 15-year trajectory that evolved from distributed systems engineering to pioneering streaming platforms, André is passionate about making complex streaming technologies accessible and practical. 

At Nubank, he played a key role in scaling data platforms and developing innovative architectures such as Avalanche, which won recognition as Nubank’s most innovative internal product in 2024.

Understanding Streaming vs. Batch Processing

Streaming data processes continuous, unbounded datasets, unlike batch processing, which deals with discrete, bounded datasets. Imagine streaming as ongoing highway traffic—continuous and unpredictable—compared to batch processing, resembling shipping containers transported at scheduled intervals with clearly defined contents.

While batch processing efficiently handles large, defined datasets and allows easy optimization, streaming provides immediate data processing capabilities essential for real-time applications. Streams represent sequences of events over time, whereas tables are static snapshots of these streams at specific moments, illustrating how batch processing can be considered a subset of streaming.

Check our job opportunities

Avalanche Platform: Core components and capabilities

Avalanche is Nubank’s robust streaming data platform designed explicitly to simplify real-time analytics. It leverages two key technologies:

  • Flink: An advanced streaming engine specifically designed to manage and process unbounded datasets efficiently, providing low-latency performance.
  • Pinot: A real-time analytics database developed at LinkedIn, optimized for high-performance OLAP (Online Analytical Processing). Pinot excels at ingesting and analyzing massive amounts of streaming data with minimal latency.

Avalanche integrates these technologies, enabling rapid processing, analysis, and data delivery critical for timely decision-making and real-time insights.

Technical foundations: Event time, watermarks, and windows

Effective streaming relies on mastering key technical concepts:

  • Event time vs. processing time:
    • Event time marks when an event actually occurs.
    • Processing time records when the event is handled by the streaming system.
  • Watermarks:
    • Indicators showing the progression of event time, enabling the system to know when enough data has been collected to proceed with computations. Watermarks ensure accurate, consistent results.
  • Windows:
    • Structures dividing continuous data streams into intervals (e.g., one-minute segments), facilitating organized, accurate data computation and analysis. Windows are essential for performing precise analytics and joining multiple data streams.

Practical use cases 

The meetup provided detailed examinations of real-world applications of streaming at Nubank:

  • Fraud detection:
    • Streaming enables real-time analysis of transactions, crucial for quickly identifying and addressing fraud. Immediate responses significantly enhance effectiveness compared to batch approaches.
  • Real-time feature engineering:
    • Utilizing Avalanche and Pinot, Nubank transforms streams into real-time features for machine learning models. Features are dynamically generated through queries, simplifying traditionally complex engineering tasks, and allowing rapid iteration and deployment.
  • Clickstream analytics:
    • Shifting analytics from batch processing to streaming resulted in substantial performance improvements, providing instant user-experience insights and dramatically reducing operational costs. Pinot’s optimized real-time query capabilities proved crucial for managing vast clickstream data.

Overcoming real-world streaming challenges

While offering many benefits, streaming infrastructure presents distinct challenges:

  • State management:
    • Maintaining accurate state across distributed systems is complex, demanding sophisticated management strategies and robust frameworks.
  • Zero downtime deployments:
    • Ensuring uninterrupted service during updates or scaling events requires advanced operational planning and infrastructure support.
  • Accessibility and adoption:
    • Increasing accessibility for non-technical users through intuitive interfaces and clear documentation remains essential to wider adoption and effective use.

Developments and innovations

Participants learned about recent innovations reshaping streaming:

  • Unified batch and streaming architectures:
    • Tools like Flink enable the combination of batch and streaming processing within a single infrastructure, streamlining data management, reducing costs, and increasing efficiency.
  • Advanced Dataflow strategies:
    • Modern dataflow approaches, such as continuous changelog integration, enhance responsiveness and reliability, eliminating complex batch-job dependencies.
  • Lakehouse architecture:
    • Emerging Lakehouse systems integrate streaming and batch processing within unified storage architectures, improving flexibility and simplifying data management.

Conclusion

Nubank’s DS&ML Meetup #97 delivered comprehensive insights into streaming data infrastructure, from fundamental concepts and advanced techniques to practical applications and emerging innovations. The detailed exploration of Avalanche and related technologies underscores Nubank’s commitment to technological leadership, driving innovation and efficiency in digital banking.

Follow the Building Nubank blog for more deep dives into technology and innovation, and check out our job openings! Let’s build the Purple Future together!

Check our job opportunities