Managing Cloud Limits

Written by: Elton Peixoto, Diogo Dantas.

Nubank Engineering Meetup #10 shined a spotlight on one of the most critical yet often under-discussed areas of modern cloud infrastructure: managing cloud limits for sustainable growth.

Held in São Paulo, this event drew tech enthusiasts, cloud specialists, and software engineers of all seniorities eager to discover how Nubank scales its global financial platform without sacrificing performance or reliability.

From AWS capacity management to Kubernetes orchestration and sharding best practices, the meetup offered a deep dive into the strategies and frameworks that keep Nubank’s systems both agile and cost-effective.

If you’re looking to future-proof your cloud architecture, harness multi-tenant deployments, or simply enhance your observability and cost optimization techniques, read on to explore the core lessons and cutting-edge solutions presented during the session.

The scale of Nubank

Nubank now serves over 110 million customers across Brazil, Colombia, and Mexico, which puts us on a global stage for digital financial services. To support this enormous user base, our infrastructure must handle tens of thousands of pods, billions of Kafka messages, and a constantly evolving web of microservices.

This level of scale does not happen overnight. Early on, our engineering teams leaned heavily on AWS for compute, storage, and networking. As our customer base grew, so did our reliance on scalable cloud infrastructure to accommodate surges in user transactions.

Today, we orchestrate more than 4,000 microservices with Kubernetes, process 72 billion daily events via Kafka, and routinely handle millions of requests per second. Each day, we keep a careful watch on resource usage to make sure we’re not only prepared for the next wave of growth but also optimizing costs for the long run.

Check our job opportunities

From “Pangeia” to multi-account architectures

When Nubank was smaller, our infrastructure was grouped into a few massive AWS accounts. Internally, we called this “Pangeia,” evoking the image of a single supercontinent. Over time, however, we ran into several problems.

A minor incident could trigger a major outage, since everything was tied to a limited set of accounts. Isolating environments like staging and production also became harder, which in turn slowed down deployments.

In response, we shifted to a multi-account strategy, which we refer to as “continental drift.” Business domains and teams can now have their own AWS account, making it easier to isolate and manage services at a more granular level.

This change dramatically reduced what we call “blast radius”—the risk that one small issue can escalate into a large-scale failure. It also gave each domain the freedom to evolve independently, significantly improving reliability and speed to market.

Sharding all the things: Datomic, Kafka, and beyond

With separate AWS accounts providing a foundational layer, we also implemented a sharding model across our data and services. Sharding is often associated with databases, where you split large data sets into smaller, more manageable pieces.

Nubank took this idea a step further by sharding not just databases, but also key services and entire microservice clusters.

Our primary database, Datomic, runs multiple transactors in Kubernetes, each handling a subset of data. The same pattern applies to Kafka: we maintain multiple Kafka clusters for different shards to spread out messaging loads.

This approach serves two key goals: first, it prevents localized problems from spreading too broadly, since each shard is relatively independent; second, it allows us to forecast capacity in smaller, more predictable units. By analyzing usage in one shard, we can better project how others will behave.

Capacity management and AWS limits

Despite the popular notion that “the cloud is infinite,” every cloud provider imposes limits or quotas—ranging from load balancers and EC2 instances to overall storage capacity. At Nubank’s scale, those quotas can become a bottleneck if not carefully monitored.

Managing capacity is a shared responsibility across our engineering teams. Anyone spinning up a new microservice must consider how many CPU cores, how much memory, and which types of AWS resources they require.

While the Infrastructure team provides best practices and automated monitoring, individual domain teams remain closely involved in cost and capacity analysis. This prevents unexpected spikes in resource usage and ensures that each service is right-sized for its workload.

Detecting problems early: The AWS limit smoke detector

One of our most relevant in-house developments is the AWS Limit Smoke Detector. Essentially, it’s an ETL pipeline that collects usage, quota, and cost data from every Nubank AWS account, then aggregates that information in BigQuery. By continuously comparing actual usage against our assigned quotas, this tool spots potential trouble long before we hit a critical threshold.

The pipeline references multiple AWS sources, including Trusted Advisor, Service Quotas, and CloudWatch, as well as direct API calls where necessary. It generates a historical view of usage, allowing us to forecast how soon we might reach a limit.

When an account’s usage trends upward at a rapid pace, the Smoke Detector issues alerts via dashboards and automatic messages, prompting us to take action—either by increasing quotas or optimizing the offending service’s resource consumption.

This proactive approach helps ensure that hitting an AWS cap never results in a major product incident. It also provides our Cost Management team with valuable insights into where budgets might need adjusting. Rather than waiting for a surprise AWS bill or a last-minute quota shortfall, we tackle potential overreach well in advance.

Multi-Tenant vs. Single-Tenant: A balanced approach

Throughout the meetup, we discussed how Nubank’s architecture mixes multi-tenant and single-tenant models. Multi-tenant platforms allow many in-house teams to share the same infrastructure—for example, multiple microservices can reuse the same Kubernetes clusters or the same Kafka broker pools. This pooling leads to more efficient resource utilization and often cuts down on operational overhead.

In certain scenarios, however, a single-tenant approach is chosen. Some business-critical workloads, particularly those with extremely specialized performance or security requirements, may need their own cluster or dedicated environment.

While single-tenant can be more expensive, it can also give teams the tight control and isolation they need for data-sensitive or high-traffic systems. Balancing these two models—and being strategic about which services go where—is an ongoing process that keeps our platform both flexible and cost-effective.

Looking ahead

Nubank’s approach to managing cloud limits underscores the importance of proactive capacity planning, shard-based isolation, and multi-account strategies in any large-scale infrastructure. As demand continues to rise, ensuring stable performance while containing costs is more than just a technical challenge—it’s a competitive advantage.

By combining Kubernetes orchestration, AWS quota monitoring, and a shared responsibility model across teams, Nubank consistently delivers a resilient and frictionless experience to millions of customers.

Whether you’re a budding fintech startup or an established enterprise, the principles and real-world insights from Nubank Engineering Meetup #10 can guide you toward more robust, scalable, and cost-conscious cloud operations. Stay connected for upcoming meetups, where we’ll delve deeper into the technologies and practices driving innovation at Nubank and beyond.

Check our job opportunities

The scale of Nubank

From “Pangeia” to multi-account architectures

Sharding all the things: Datomic, Kafka, and beyond

Capacity management and AWS limits

Detecting problems early: The AWS limit smoke detector

Multi-Tenant vs. Single-Tenant: A balanced approach

Looking ahead

Leave your comment below

0

most read

Careers

Quick Navigation

Quick Navigation

Other topics

Careers

most read

Working at Nu

Managing Cloud Limits

The scale of Nubank

From “Pangeia” to multi-account architectures

Sharding all the things: Datomic, Kafka, and beyond

Capacity management and AWS limits

Detecting problems early: The AWS limit smoke detector

Multi-Tenant vs. Single-Tenant: A balanced approach

Looking ahead

Leave your comment below

0

.typography-7499 { color: #000; background-color: transparent; margin-bottom: 0px } @media (min-width: 768px) { .typography-7499 { margin-bottom: 0px } } most read

.typography-8910 { color: #FFF; background-color: transparent; margin-bottom: 16px } @media (min-width: 768px) { .typography-8910 { margin-bottom: 32px } } Careers

.typography-3218 { color: #000000; background-color: #FFFFFF; margin-bottom: 0px } @media (min-width: 768px) { .typography-3218 { margin-bottom: 0px } } Quick Navigation

.typography-3218 { color: #000000; background-color: #FFFFFF; margin-bottom: 0px } @media (min-width: 768px) { .typography-3218 { margin-bottom: 0px } } Quick Navigation

.typography-5522 { color: #000000; background-color: transparent; margin-bottom: 48px } @media (min-width: 768px) { .typography-5522 { margin-bottom: 48px } } Other topics

.typography-1590 { color: #FFF; background-color: transparent; margin-bottom: 16px } @media (min-width: 768px) { .typography-1590 { margin-bottom: 32px } } Careers

.typography-5249 { color: #000; background-color: transparent; margin-bottom: 0px } @media (min-width: 768px) { .typography-5249 { margin-bottom: 0px } } most read

.typography-4622 { color: linear-gradient(0deg, #00A851 0%, #00A851 10%,#1832D7 100%); background-color: transparent; margin-bottom: 0px } @media (min-width: 768px) { .typography-4622 { margin-bottom: 0px } } Working at Nu

Discover more from Building Nubank

most read

Careers

Quick Navigation

Quick Navigation

Other topics

Careers

most read

Working at Nu