most read
Software Engineering
Why We Killed Our End-to-End Test Suite Sep 24
Software Engineering
The value of canonicity Oct 30
Culture & Values
The Spark Of Our Foundation: a letter from our founders Dec 9
Careers
We bring together great minds from diverse backgrounds who enable discussion and debate and enhance problem-solving.
Learn more about our careers



At Nubank, the reliability squad is continuously seeking to improve our incident management procedure by providing tools, better processes, and much more. Our goal is to support our engineers on the journey to mitigate operational issues in a healthy environment, based on a blameless culture and being compliant with all regulatory rules regarding financial companies.
Any kind of issue affecting our systems that, in some way, impacts our customers can be considered a technical incident; it’s identified by our monitoring systems and must be fixed as soon as possible by our engineering team.
An incident can be divided into two parts: the first one being the incident handling itself and the second one the actions taken after an incident, like action plans. Let’s take a tour and see how we’re dealing with these situations that we avoid, but occasionally could happen.
Identifying an Incident
Our alerting system is a subject for another post but, in short, squads can create custom alerts to their services, and each service also has a set of default alerts, such as “service down”. They’re notified on their slack channel and the on-call engineer from the squad responsible for the system is paged by OpsGenie, if an incident is identified, they have to instantly start working on it.
Check our job opportunities
Opening a Crash
We follow a simple framework in which the first step is to “open a crash”. This means to notify the entire company that we are facing an incident and Nubankers are already dealing with it.
The identified incidents are reported using a bot through Slack (the main internal communication tool), this automation centralizes all the management of the incident: people use it to create, edit, and close. The main benefit of using it is to organize the situation, trigger the other stakeholders (such as the risk and compliance team) and give the proper visibility to the company. Besides that, we are also able to get data about incidents to extract key metrics, like our MTTR (one of the Accelerate metrics).
Before opening a crash, first, the engineer involved needs to understand the severity level, classifying it between 1 (critical incident) and 5 (cosmetic issue). These classifications include criteria regarding availability, amount of customers affected, product affected, regulatory matters, and others.
The main information needed to open a crash are:
After submitting, a summary of the incident will be posted in Slack notifying the appropriate teams about the crash while engineers are working on fixing it.
Working on it
In this step, as you may imagine, anything can happen. People usually open a voice call and start working on debugging and fixing the issue, operations teams start preparing understandable explanations for our clients, and the focus of the engineering team is to mitigate the impact and recover the system back to its proper state.
At this point, it’s important that every one that is able to help with something gets involved (especially in high severity levels incidents), and the Nubanker in charge of comms keeps updating the incident thread with news about it – so everyone in the company can be aware of it in real-time.
After the crash is completely fixed, and nothing unusual is happening, the crash can be closed using our bot and everything is fine again!
Blameless culture and Postmortem
Postmortem is essential in incident management. Its main objective is to ensure that companies learn from crashes, register them, and ensure knowledge sharing about them.
At Nubank we write a postmortem for all crashes of high-level severities, but we recommend it for all severities. After the crash is closed, engineers should write a document about it, following a specific template, with these topics:
After this document is published, it’s available for the entire company to read and learn from it, and engineers start to work on the action plan to prevent it from happening again.
We wouldn’t have a healthy environment to deal with crashes and post mortems if we didn’t live in a blameless culture: we don’t try to find a culprit, but rather try to understand what happened and what needs to be done so that it doesn’t happen again.
As a celebration of our blameless and postmortem culture, we have a monthly meeting with the entire company, where people involved in some crashes from the current month share lessons learned, and actions to be taken.
A common way of reacting to incidents at Nubank is to say “fascinante” (fascinating in English) while putting the hands above the head (being a Slack reaction now that we work from home), this truly symbolizes the way that we deal with incidents here, somethings it could happen, but when it happens we consider it fascinating, and we love to learn from it.
This is a picture of this meeting before the pandemic with everybody reacting with “fascinante”:
Final thought
Our incident management process is constantly being updated, to always work in the best, effective and simple way. Future changes will happen (they’re always happening), but more important than the process is the culture: people acting blamelessly, helping each other, and always trying to improve and providing our customers with the best experience possible.
Blameless culture is the most important aspect of our incident management.
Check our job opportunities