Uber incident management meetup (Site Reliability Engineering — SRE)
Uber hosted a meetup to talk about incident management. It was an intro into a company that lives on micro-services and how it survives when the person who wrote the service has disappeared and a meteor has destroyed a data centre.
Sumbry kicked us off with an overview of what he does and why he does it. He wanted a mission with interesting challenges and growth. Uber doesn’t want you to have cars. Uber is growing to be ubiquitous across the globe. Tick. Tick.
He went on to talk about the cycle of feelings at Uber: Excitement, Doubt (of yourself), Justification (of why you are trying so hard — ‘changing the world’), Impostor (the inevitable syndrome of engineering) and Acceptance (it is hard but that is why they have you). Then repeat.
Next Uber and it’s affinity for microservices. Thousands of engineers. Thousands of microservices. Dozens of product teams. Dozens of infrastructure teams. Small teams is everything. They have freedom — no commitments; no rigid roadmaps. Go and Java are tier one technologies that are encouraged. After this Python, Node.js, Cassandra, Kafka, etc. A side note: Uber typically use Docker with Mesos with an in-house solution similar to Itsio as a service mesh.
Daniel Simmons and Nick Lee
We were shown a dependency graph of all the Uber microservices used across the company. Its huge. It causes a problem as their repos are decentralized as are the knowledge and expertise. Each service is a black box. That may be fine during development but isn’t when we are supporting that service.
There is usually an incident where one simple change that shouldn’t have the ability to break everything does.
To help: Ring0 — volunteering group inside Uber. The team is focused on stopping failures as quickly as possible. To do this, volunteers need training as well as power to carry out the various roles (shown in the image).
Ring0 members have root level privileges. They have big levers, the power to degrading or shut down features, or moving load to other data centre.
They are building tooling to reduce cognitive load through two commands stemming from a CLI package called
omg to respond to a multitude of incidents.
Rule 1: They always ensure there is spare capacity to be able to move over load to another data centre (at least 1x). In fact, an entire data centre went off from a power cut. In jumped Ring0 with the CLI tool and users barely noticed anything as load was moved quickly.
Rule 2: Test this ability, regularly.
Rule 3: Uses OODA loops - observe, orientate, decide and act. What’s broken? What can we do? Is that okay? Did it work? What is broken now?...
So what can we do. Organise so that you have the right response and tools are in place.
Assess (confirm issue), mitigate (remove tasks from the guys who fix), delegate and communicate.
omg is service made by uber (outage mitigater) CLI Tool. What are your supporting tools?
Note: Plugin based architecture so it is expandable too for new services.
Other tools: Black box metrics, white box metrics, tracing and then logs.
Black box — probe that mimics customer behaviour in production — these tests may give you an early warning you wouldn’t get naturally. Has sign in’s broken? It is a hard thing to find out without a heartbeat test. You could one up this by making it location/action focused. Simulate your customer.
White box — business/flow metrics (number of x happening over time, is it down? are we healthy?). If you can quantify damage in dollar terms, it can help the urgency of the response. Design your dashboards like layers to allow diving into metrics further on each click.
Start with the golden signals traffic, throughput errors, latency and saturation. Make it understandable to everyone. Expect on prior knowledge. Put descriptions on everything in Kibana.
Tracing — every issue with microservices is a murder mystery. You need to know dependencies and latencies and where errors are.
Logs — this really is in the trenches as the others should solve this before. Sentry can aggregate errors.
Note: Event or audit logs are useful (mini-reports on changes after merging/deploying for example). Majority of issues stem from somebody changing something. This is a way how. Deploys and configuration changes should be monitored here.
Looking at logs: Uber has built an internal tool to digest information of systems errors and creates a visualisation of linking system dependencies. It also provides compositions of good behaviour against bad. Red boxes are placed on erroring items. It also recognises what should have been called but hasn’t colouring them grey.
Micro-services when they are created must use standardised libraries so logging is compatible. These libraries must be updated. Mono repo was mentioned to help solve it. SRE engage all flows and service owners to ensure it is engaged at conception.
Security is completely separate to failure logging.
- Mitigate first.
- Make protocols now not after.
- Automate all tests, tracking and actions
- Invest in tooling (though do judge business value)
- Go was best on performance and demands consistency in form so it is easier to jump into when looking at another persons code.
- There is a postmortem procedure document which reflect on process and is completely person blameless.