Crisis Engineering in 5 Steps

Nov 14, 2023

We’ve written about the structure of a typical Layer Aleph project. Below is what is what we are trying to accomplish inside that structure. Organizational research calls this sensemaking or an OODA loop.

Talk to everyone. Make a map. You want the systems as they work in practice, with all the undocumented / shortcutted / unofficial connections. When you’re done you should be able to trace the critical flow of work all the way through, and know what actions happen along the way. Consensus and breadth is more important than perfect accuracy. Keep asking questions until answers start to converge. Get real metrics from live systems whenever possible. Numbers in reports are lies.
Get all the experts and an authorized decision maker in one place, sharing a consensus reality As you make your map, collect names of key technicians, administrators, and decision makers for each step. They need to know how to make changes or take other action on their part of the map, and understand any direct interconnections. Try to get a small set of people with the broadest knowledge, skill, and authority in one place, looking at the same map and data sources. Ideally this is in person but a chat channel or videoconference can work as well. It must be possible to adjudicate almost any decision without escalating outside this group.
Understand the operational timeline. What is the deadline for mitigation or recovery? What does success look like? Often this deadline is externally imposed, but it must be clear to your crisis engineering team. Broadcast the deadline and goals, doubly broadcast when they change.
Try something. This step is critical. It is impossible to understand a system without trying to change it. In many projects, our first mitigation is subtractive: turning something off, eliminating a workflow, blocking certain kinds of access. Did it work? Did it kind of work? Did a new problem appear somewhere else? Unless you’ve made things a lot worse (see 5), focus on updating your team’s map to include the new discoveries. Maybe you need new experts? Maybe a new set of solutions is visible? Goto 1.
Do not create new problems Specifically, don’t create new unknown problems. Intentionally turning off the system for a few days (or even weeks!) to clear a backlog is fine if you understand the consequences.

That’s it. Speed is important, we try to complete the first pass of these steps in a day or two. We believe this process can be taught, if you’d like to learn more sign up for our workshop.