🤖 Rosie the Robot: ChatOps for Incident Response

Automating timeline generation for blameless postmortems

Richard Li
Ambassador Labs
3 min readMay 14, 2021

Here at Ambassador Labs, we take operational excellence very seriously. Thousands of organizations rely on our software and cloud systems, so we’re constantly engineering for availability.

Case in point: while our API Gateway is deployed as a container inside your Kubernetes cluster, you need to get your images from our container registry. We’ve found that all container registries have downtime. And if your Kubernetes cluster doesn’t have our container images cached, this can cause an outage in your cluster. So, as part of release process, we now push images to multiple registries: Docker Hub as our primary, but also Google Container Registry as a secondary source.

Nonetheless, engineering for availability is not a panacea. Incidents are a fact of life. So as important as it is to engineer for availability, it’s also important to build robust incident response and postmortem processes so you continue to improve.

🤖 Rosie the Robot

Written by our SVP Engineering, Bjorn Freeman-Benson, Rosie the Robot is a Slack bot for incident response and postmortems.

Rosie the Robot APP 7:59 PMUsage:
/rosie create channel <incident description>
/rosie list runbooks
/rosie open memory
/rosie open runbook
/rosie recommend runbook
/rosie start runbook [name]

When an incident is declared, our incident first responder uses Rosie to create a Slack channel to discuss the incident. The entire incident response team hops into this channel to troubleshoot the issue.

Once an issue is identified, the incident response team uses Rosie to start the appropriate runbook. Our runbooks are structured as a series of checklists, e.g., acknowledge incident, assess impact, notify users, and so forth. The incident response team methodically goes through each step of the runbook, checking off each item as they are completed.

One of our runbooks

Blameless Postmortems

Behind the scenes, Rosie records a detailed timeline of events. When a step of the runbook is completed and checked off, Rosie records the time and event. In addition, any person responding to the incident can add a 🤖 emoji to any Slack message, which will also prompt Rosie to record the particular message in her timeline. Once the incident is resolved, Rosie will produce a full timeline of the entire incident. Here’s an example:

| UTC       | Who        | What
------------+------------+------------------------------------------
04-28 01:44 | Bjorn | Created channel #incident-2021-web-outage
04-28 01:45 | Bjorn | Set Sev 1
04-28 01:45 | Bjorn | Set Code Orange
04-28 01:46 | Bjorn | Completed checklist "Assess Impact"
04-28 01:48 | Alex | I think the issue is the CDN.
...

After every incident, the response team will conduct a postmortem. The detailed timeline provided by Rosie gives a clear view of all communications and decisions made. During the postmortem, opportunities for corrective action and process improvements are identified. These actions are then assigned to engineers.

ChatOps for operational excellence

Importantly, Rosie is not just a way to improve our incident response. Rosie helps us implement ChatOps by integrating communications as part of our incident response workflow. In other words, Rosie is codifying our social conventions for incident response. These conventions are easy to miss in the heat of an incident, but keeping them top-of-mind is critical to make sure nothing is missed.

Looking to work on Kubernetes?

Rosie and ChatOps is just one small part of our focus on operational excellence. We drink our own champagne 🍾, using Telepresence, Argo, and Edge Stack in production. We hold regular game days, simulating incidents. And much more.

If you’re looking to work on mission critical Kubernetes infrastructure, we’re hiring engineers (and all sorts of other roles).

Sign up to discover human stories that deepen your understanding of the world.

Published in Ambassador Labs

Code, ship, and run apps for Kubernetes faster and easier than ever — powered by Ambassador’s industry-leading developer experience.

Written by Richard Li

CEO, Amorphous Data. Formerly: Ambassador Labs, Duo Security, Rapid7, Red Hat.

No responses yet

What are your thoughts?