Most of our jobs at Layer Aleph have been to fix a specific, urgent problem. We founded the company hoping to be a kind of Titan Salvage for computers—we are available on short notice, we go where most people won’t, and we take contracts where the deliverables are do-or-die outcomes with no points for effort. By various paths, our careers have rendered us specialists in incident response and self-sufficiency, and we noticed there aren’t many outfits like that on the consulting market. There’s plenty that will send you snappily dressed 25-year-olds to work on PowerPoint slides, and we don’t compete with that. We wouldn’t win on any dimension.
Last summer, though, we took a job (#18) closer to PowerPoint, because our history in Google SRE was useful to the customer. This was a multinational corporation with hundreds of thousands of employees, in the finance industry. At the point we arrived, the tech department was several years into a strategic plan to reorganize around “SRE,” and one could say there had been a loss of momentum.
Our first week was much the same as it always is. We went on location, debriefed 48 employees and team leads of various kinds, and tried not to block the aisles on the operations floor during a ride-along. Information-gathering benefits from our motley appearance: Carla looks like the principal software engineer who works from a fancy coffee shop, Weaver looks more or less like an ex-spook who has a freezer full of deer freshly hunted from his back acres, Marina looks like the best-selling business author with a speaking agent, and Mikey looks like an itinerant adjunct professor who sleeps in a tent. (Which is to say, each of us looks like what we are.) Most people that we need to approach will be comfortable talking to at least one of us.
Once we had a reasonable handle on the current state of the organization, it was clear that our first task was to help this company decide what exactly SRE does here. Or maybe what it is. Is it a job description? A book? A set of memorized jargon? A bag of tools? A big AWS bill? A bimonthly curated box of snacks? The tech organization had tens of thousands of employees, and there were pockets of people fixated on each of these things, most of them doing their level best to “do SRE.”
This was a new situation for us. Most tech jobs are clearly sortable into two types: tech-focused tech-companies, or companies where the technology is “only” there to support some kind of real-world operation. At tech-focused companies, SRE is already established or at least recognized as a guild. Nobody worries about what it “is” any more than towns in Wisconsin worry about what Presbyterianism “is.” At non-tech companies, nobody could care less about which jargon word the IT people are talking about this month.
This financial company, however, was caught between worlds. Market operations consist entirely of computers talking to computers now, yet these institutions have too much inertia and capital to be seriously threatened by tech-focused upstarts. With a hundred thousand employees working under orders of “do SRE,” definitions and labels matter.
To figure out some answers, we tried to reconcile the varied perspectives in the client company with what we know from industry, government, and mid-2000s Google, the original Eden where (we think) the “SRE” jargon word was invented. First of all, if this brand is going to be ready for export, it needs an answer to why a non-tech business cares.
The SRE books and presentations at conferences are largely case studies of new companies that were able to build their organizations and technology around best practices from the early 2000s. They don’t help CEOs that don’t care about technical debt or the hipness of their serving stack. Arguments based in dollars are a better bet. We decided that at its core, the value proposition of “SRE” is the idea that you can handle an exponentially growing business with a logarithmically growing payroll. Any business at any size gets excited at the idea of increasing revenue by a big amount while increasing costs by a small amount. At least in a profit-oriented context, we think this is what “SRE” is for. If someone wanders down from the mountains with new stone tablets, and the new tablets lead to better businesses that run more efficiently, then that new thing is what “SRE” is now. The words are generic after all–“site reliability engineering,” with no (TM).
Knowing what SRE is for makes it easier to decide who is and is not Doing It Right. Looking at this company and others, we saw a few patterns that were not generating any progress in the direction of higher scale at lower cost.
One is the impulse to “standardize” tech stacks across diverse lines of business. We don’t claim that this is never the right answer, but it’s not effective at this company’s size and stage of evolution. Like many others, it grew for decades in a decentralized way, with plenty of mergers and acquisitions. Each business unit is mature, which means that it spends almost all of its energy sustaining itself, with no more than a few percent left over for self-improvement efforts. Any change at all means you are trying to mobilize those few percent to overcome the inertia of the rest, and if you push “standardization” you are aiming at an outcome that will be, for them, pointless churn. Yet people step into this glue trap again and again, because if you only think about the computers, it’s obvious that all these different, fussy, bespoke things are just databases (or whatever), and it would be cheaper overall to consolidate them.
Efforts that set out for standardization tend to morph into an adjacent pattern that is even less effective, which is gatekeeping. Inevitably, the standardization push stalls (we’d put the over/under at about year 3 and 30% adoption), and someone decides it is time to get tough. No more SaaS subscriptions can be purchased outside of the CIO office, no more capacity can be provisioned in the on-premises datacenter, and so on. At this point you are taking that two-percent share of change agents, which has just shown you that it can’t complete the standardization push, and turning them into an active threat that must be resisted. The better plan is to just stop.
We have also noticed that it tends to be useless, and probably harmful, to focus on SLOs or incident response practices as the first step. These are ubiquitous topics at tech conferences, and will likely turn up later in a complete SRE tranformation. But like technology standardization, they will create frustration and gatekeeping behavior if introduced before the right motivations and culture are in place.
However, we couldn’t make our entire recommendation out to be “stop hitting yourself.” Customers don’t like that. So we had to keep the thinking caps out and answer: if the above doesn’t work, how do you establish SRE in a heritage organization?
Stripped of the implementation details of a particular host company, we think that SRE Done Right is a culture with two core values: Smoothly running, highly reliable computer systems are good, and rote repetition in human jobs is bad. Some people are natural-born true believers, and others become true believers with exposure. (We’re not claiming either type is better, and we’re not sure which type we are ourselves.) But it appears that when a handful of people with these values end up with responsibility for a high-stakes serving system, and enough autonomy to do things, a culture recognizable as “SRE” emerges. It is like Unix–an evolved pattern of solutions to a repeating pattern of problems, that will re-evolve in as many contexts as it needs to.
Thus, if your goal is to have your organization Do SRE, it’s not a matter of training a bunch of people to follow a new three-ring binder. It is about creating the conditions for some accelerated evolution. Consolidate the people who are always complaining about systems stability and repetition in their job (you already know who they are), and do the following:
- Give them end-to-end responsibility for some business process that matters.
- Make the standards to which they will be accountable clear, with numbers if possible–i.e. this batch process will complete on time 9 out of every 10 days.
- Make the resources and budget clear. Within reason, it doesn’t matter what the budget is, as much as it matters that everybody knows it. We’ve seen big successes on a $100,000 infrastructure budget, and big failures on a budget of $5 billion. Computers can do a lot for not much money these days, but they can also do a lot for a lot of money, and you don’t want factions fixating on different targets.
- Give them commensurate authority to make changes anywhere in the stack. This is necessary to remove all the places to hide. Otherwise, expect to be hearing in a year that the availability target was missed because of [problem owned by team X].
- Use social and organziational markers to set them apart in some low-key way. You’re trying to develop a counterculture, and you can’t be punk rock and also on the Billboard Top 100. It doesn’t work that way. Sometimes these are big things like a separate office or a new job classification. Sometimes they are small things like a sticker meme or using a Yubikey. Implementation heavily depends on the host culture, and it’s dangerous to get wrong, so be careful.
Then, don’t worry about the details. It isn’t important if you buy the right holy book, and it isn’t important if you pick the right tools with the right SRE branding. (Especially because we aren’t worrying about standardization at this phase, right?) Even doing everything right, success is not guaranteed, because it’s always possible that you don’t have the budget, or the talent, or–most common by far–the executive fortitude to see it through. But we think the above conditions are the key decisions that give you the best chance.
If you see it through, the culture change will take root in your formative team, and you can grow it outward from there. A second tranche of your existing tech staff will want to join, once some successes become visible. It will also become easier to hire people that identify as “SRE,” because they will recognize a home.
Your team might call itself SRE or Devops, it might decide it likes stickers with sharks or pirates or neither. It might invent a new buzzword and a new holy book. You won’t care, and neither would we, if it is accomplishing exponential scaling with logarithmic costs. Just let us know where to get the new holy book.