The first edition of the 2025 Nubank Engineering Meetup kicked off with a core topic for those working with distributed architectures, microservices, and system reliability: observability. 

The event, held in February, opened the year’s technical meetup calendar and featured Guto (an engineer at Nu and the night’s host), AWS Solution Architects Lucas Vieira Souza da Silva and Luis Tiani, as well as Nubank’s engineering team. Our representatives, Caio (Engineering Manager) and Otávio (Lead Engineer), shared the behind-the-scenes evolution of our log stack and the creation of the Observability Stream and Alexandria platforms.

Their talk focused on how to integrate open source tools with AWS managed services to build scalable, efficient observability pipelines. The session covered the foundations of the three pillars of observability—metrics, logs, and traces—and included practical demos using OpenTelemetry, Prometheus, Grafana, and OpenSearch.

What is observability, really?

The opening question was simple but essential: what does it mean to make a system observable? The answer lies in the ability to answer, with concrete data, questions about the internal behavior of applications in production. This is done using three types of signals:

  • Metrics: numeric time-series data used to evaluate latencies, resource usage, and error rates.
  • Logs: structured textual records that provide detailed insights into events and application flows.
  • Traces: distributed traces that show how services interact during a request, helping identify bottlenecks and dependencies.

These signals complement one another and form the foundation for building dashboards, setting up alerts, and conducting deep system behavior analysis.

Check our job opportunities

The role of OpenTelemetry

One of the most relevant open source tools today is OpenTelemetry. Maintained by the CNCF, it provides:

  • Instrumentation libraries for multiple languages, with both manual and automatic options;
    Collectors that act as agents—receiving, enriching, and exporting observability signals;
  • OTLP protocol, which has become a market standard for telemetry data transfer.

With OpenTelemetry, it’s possible to instrument applications, collect telemetry in various formats (including Prometheus), and send it to different backends like OpenSearch, Prometheus, and more.

Open source means freedom, but also complexity

The CNCF maintains a rich ecosystem of open source observability tools, from data ingestion to visualization. But building and operating a fully open source stack requires time, expertise, and responsibility—covering infrastructure, upgrades, scalability, and security.

That’s where managed services come in. AWS proposes to simplify operations while preserving the benefits of open technologies. Instead of managing your own Prometheus or Grafana instances, you can leverage fully managed versions—streamlining integration and scaling.

OpenSearch: from Elastic Search to vector search

A key highlight was OpenSearch, a fork of Elastic Search created in 2021 and now maintained by the Linux Foundation. It’s widely used for:

  • Log and time-series analysis;
  • Search solutions for e-commerce and complex platforms;
  • Vector search for generative AI applications.

AWS offers OpenSearch in two modes:

  • Provisioned: with customizable instances, VPC integration, and cluster sizing.
  • Serverless: automatically scales compute and storage resources independently.

OpenSearch also includes OpenSearch Ingestion, built on Data Prepper, for transforming and sending JSON-formatted data to the cluster.

Building a managed observability stack

The session also explored how AWS services can be integrated to build a comprehensive observability stack:

  • Amazon Managed Prometheus: collects and stores metrics, with serverless pricing and alert manager support.
  • Amazon Managed Grafana: visualizes data from multiple sources (including CloudWatch, RDS, OpenSearch), with authentication via AWS Identity Center.
  • AWS Distro for OpenTelemetry (ADOT): AWS’s curated distribution with easy integration into EKS, ECS, and Lambda.

Real-world demo: OpenTelemetry in an EKS cluster

To make everything more tangible, Lucas presented a live demo of the “OpenTelemetry Demo” application running on EKS. The app, supported by a traffic generator, emitted telemetry signals processed by an OpenTelemetry Collector and sent to Prometheus and OpenSearch.

Using Grafana, the team correlated metrics, logs, and traces into unified dashboards, enabling:

  • Selecting a microservice and viewing its latency and error history;
  • Exploring specific traces with waterfall views and associated logs;
  • Using datalinks to jump between dashboards or open OpenSearch directly from Grafana.

All of this was achieved using Grafana variables to cross-reference data from Prometheus and OpenSearch, making incident investigation and data correlation faster and easier.

Rebuilding Nubank’s logging stack: Scaling and cost challenges

In the second half of the meetup, Caio and Otávio shared insights about the evolution of Nubank’s internal observability platform for logs — a journey shaped by rapid growth, limitations with external vendors, and strategic decisions to ensure cost efficiency and control over data.

The problem: Log volume growth and external vendor costs

With over 3,000 microservices and a growing customer base, Nubank began handling daily log volumes reaching half a petabyte. The original strategy — relying on an external SaaS vendor — began to fall short in two key areas:

  • Cost: Large-scale observability became one of the biggest infrastructure expenses.
  • Reliability: Ingestion failures and limited visibility into critical data directly impacted incident resolution.

The solution? Build a fully internal, highly scalable, resilient, and cost-effective platform.

A new platform for data ingestion

The first step in this restructuring was the creation of Observability Stream, our internal platform for collecting and processing telemetry data — starting with logs and later expanding to traces.

Technical requirements

The team established four core requirements:

  • Low ingestion latency (logs must be available in under 3 minutes).
  • Elastic scalability to handle traffic peaks like Black Friday.
  • Fault tolerance, ensuring no data loss under any circumstances.
  • Cost efficiency, aligned with Nubank’s culture of financial responsibility.

Micro-batching architecture

To balance performance and technical feasibility, the team implemented a micro-batching model with decoupled processing stages connected via queues (SQS). The flow includes:

  • Log collection using Fluent Bit.
  • Accumulation and transformation through internal services.
  • Storage in Amazon S3.
  • Auto-scaling based on queue lag.

This architecture brought robustness and modularity, paving the way for the next step: log querying.

Alexandria: Our internal log search platform

Once all logs were processed and stored, the next step was building Alexandria — Nubank’s internal platform for log querying, used by engineers across the company.

Scalable search with Trino and Parquet

The architecture relies on:

  • Log data stored in Parquet format on S3, with up to 95% compression.
  • A Trino-based query engine, optimized for massive volumes and high concurrency.
  • A continuous compaction service that aggregates millions of small files for improved performance and storage efficiency.

Impact and results

  • 0.7 trillion log lines processed every day.
  • Over 600 TB ingested daily.
  • 14,000 queries per day scanning 150 TB of data.
  • 50% cost reduction compared to SaaS vendors.
  • Entire platform maintained by a lean team of five engineers.

Efficient observability with open source and the cloud

Nubank Engineering Meetup #11 offered a deep and practical dive into the world of observability with Open Source and AWS. It reinforced the importance of metrics, logs, and traces, while showing how to build a modern observability stack that blends the openness of community-driven tools with the convenience of managed cloud services.

With real-world examples, detailed architecture, and live demonstrations, the event was a valuable resource for engineers and platform teams looking to improve visibility and reliability across their systems.

Stay tuned for the next editions of the Nubank Engineering Meetup for more in-depth technical content on the challenges and solutions involved in building simple, secure, and innovative financial products.

Check our job opportunities