The Engineering Leadership Podcast · Episode 42

Reflections on Incidents & Resilience

with Nick Rockwell

May 25, 2021

Nick Rockwell, SVP of Engineering & Infrastructure @ Fastly, shares his recent reflections on incidents, resiliency, blamelessness, and accountability. You’ll hear why the heroic model of incident response is unsustainable, how to improve reliability by closing the long-feedback loop, plus opportunities to maximize post-mortems for process improvement AND emotional processing.

Listen On

NEWSLETTER

EVENTS, PODCASTS AND MORE

NEWSLETTER

EVENTS, PODCASTS AND MORE

SPEAKER

Nick Rockwell - SVP Of Engineering & Infrastructure @ Fastly

Nick Rockwell is SVP of Engineering & Infrastructure @ Fastly helping build the next-generation edge infrastructure for a faster, safer, more resilient Internet. Nick was formerly Chief Technology Officer at The New York Times, overseeing product engineering, infrastructure and R&D. Previously he was Chief Technology Officer of Conde Nast, and Digital CTO at MTV Networks. Throughout his career, Nick has worked at the intersection of media and the Internet, building digital products at scale. Nick graduated from Yale in 1990 with a B.A in Literary Theory.

"We started doing a biweekly meeting. We talk about resilience. We revisit everything that has not been closed, whether it's a year old, or it's a day old. We're forced to keep coming back to it. So (I'm thinking about) how to move away from that incident based post-mortem to something that's more like a continual revisiting of every thread or pathway that's been opened until they're not even open anymore...”
- Nick Rockwell

Special thanks to our exclusive accessibility partner Mesmer!

Mesmer's AI-bots automate mobile app accessibility testing to ensure your app is always accessible to everybody.

To jump start your accessibility and inclusion initiative, visit mesmerhq.com/ELC

Join us for "Hiring In the New Normal!"

The landscape of hiring has completely changed this past year. To better equip engineering leaders to navigate these new challenges, we’re hosting a FREE one-day “mini-summit” on June 11th to bring together leaders and experts across the industry to share insights

RSVP Here

Show Notes

Nick’s story of why incidents, resiliency, accountability & blamelessness are top of mind (2:20)
The “heroic model” of incident mitigation and it’s emotional impact (6:41)
Building a resilient system & transitioning away from heroics to a more mechanistic incident management model (12:12)
“The long feedback loop” of incidents (15:57)
Grappling with the risks of a more process-driven, mechanistic model of incident management (21:27)
Dedicated vs. distributed incident response teams & how incident management evolves over time (24:43)
Balancing individual accountability and a culture of blamelessness (28:37)
Why you need to talk about incidents and process their residual emotions (33:12)
On maximizing post-mortems for process improvement & emotional processing (37:01)
Takeaways (40:15)

Transcript

Nick’s story of why incidents, resiliency, accountability & blamelessness are top of mind

So I know some of the topics that we wanted to cover are talking about resiliency and balancing both having a blameless culture and accountability. I think to get into the topic a little bit more would love to learn, how come this topic is front of mind for you? How come this is something that is surfaced as important to think about.

Nick Rockwell: Well, Part of the answer is that it always ends up being in front of mind. If it is in front of mind in a moment, just wait a little while and it will be. Cause something's going to happen. You know?

And it's been a theme for me, probably for most people, at least with, you know, with fairly diverse or long technical careers. It's been a theme for me throughout my career. And I actually could reference you know, my time crawling around under trader's desks as a, as a good start where I learned, at a certain level, like how to try to fix things once and not have to come back and get yelled at by that particular trader again.

Even before that I spent my teenage years working on, my father's farm. You know, I was just thinking about this earlier today and, we learned some bitter lessons in resilience in that environment as well. The rains coming, the baler breaks. What do you do?

And then, you know, my early years at Sonic net, I was, obviously on call all the time. often as the sole developer or the only person who understood the systems at all. There were some really challenging, dark moments on honestly.

But most recently I've been back thinking heavily about resilience... I'm at Fastly. Now, as you know, I run Engineering Infrastructure at Fastly, which is a cloud platform. I'm a CDN, but also now we're able to security company as well and have a number of other applications that run on our edge cloud.

And, you know, we're in the resiliency business. And yet, it's a challenge for us like this for anyone. And in January we had a pretty significant outage

The very worst kind in that it was global that affected all of our systems. For a short period of time. We had a Swift resolution, but it was a very, very painful experience. And in a company like Fastly that's really the right word. Like it really, it caused everyone a good deal of like pain, emotional pain and shame. And I want to talk about that.

But that's what's focused me right now and, I now say resilience is my theme. My sole theme, my sole focus for 2021. It's probably going to be my focus for 22 and it'll probably be my focus for 23 as well. I think that'll be my focus while I'm at Fastly, because it's so central to what we do.

You know, so that was a a refocusing moment and I've been trying to really think about it from the bottom up again, to re-examine the way I think about resilience... and coming at it, like in this context, I'm coming at it as a generalist, not an expert.

We have many smart people who know a good deal about how to make distributed systems resilient. And I'm thinking about it from the ground up in the context of like where all the tendrils of resilience lead. And it turns out they lead all over the place.

You know, they lead to process that lead to culture. They lead to your business model. They lead to the like emotional contract you have with each other in the company. They lead to the,kinds of relationshipsyou build with your customers. They lead all over. So, that's what I've been occupying myself with specifically.

Patrick Gallagher: I didn't know that you grew up on a farm.

Nick Rockwell: Partially.

Patrick Gallagher: I do this, or you shared about the, or the story you teased about the hay baler. I'm in North Idaho with my fiance's family and they live on about five acres and we had a wind storm and 17 trees fell down and that was a big lesson and having to sort of manage your property and then clean it all up because then Six months later, there was a second wind storm. And another series of trees fell in, all of them were close to damaging all of the different buildings and stuff on the property. So when you're sharing that story, it's I definitely relate to that.

And I think for a lot of the engineering leaders, like the shame and the emotional challenges around outages, I think is something that a lot of people can relate to because it is a painful process...

With all the different experiences you had, I was wondering if you could share. Some of the cascading effects that you've observed or the implications of when those types of outages happen or, if you don't repair the hay baler on the farm, like the cascading effects that happen if people don't have appropriate resilience or processes in place to support when these big issues happen.

The “heroic model” of incident mitigation and it’s emotional impact

Nick Rockwell: I'll talk about that in two ways. First kind of in, the way, like the simple core insight about how to use continuous improvement to build resilience, the way that that tends to break down or one of the ways it breaks down. And then I'll talk about it sort of in emotional terms. And I'll tell a little story from my early days.

Again, cause I like. When we talk about failure, I like to talk about my own failures rather than those of others, because those are the ones that I certainly have the most insight into.

The simplest core insight about how do you create resilience and how do you combat operational failure is in creating you know, a continuous improvement environment based on, creating the right kinds of feedback loops.

And then, you know, typically it's like, something goes wrong. investigate why, and you need to do a number of things to make sure that investigation will be fruitful. And that has to do with creating kind of psychological safety as we all know. And then when, hopefully you mitigate the problem, but you also then continue and you hunt and you find the root cause. And then you feed on mitigation of the root cause like back in to your development process.

One of the problems that I see is that actually let's say, early in a company's life cycle... there's just no way you're going to be anywhere close to the kind of process and structure that you ultimately want to get to, to create a real resilience. There's so much that goes into that. There's no way that you can be very far along that journey.

So what you're going to rely on is basically the heroic model. You're going to rely on key people on your team who just care a lot. And like, feel it, you know, when there's a problem and are going to dive right in and they're going to work like hell until it's fixed. Nobody would make it past that stage without that kind of heroism. So I'm not gonna say that's a bad thing. I'm going to say it's a necessary thing.

One of the side effects of which is that creates a like emotional cycle of crisis and relaxation, basically. Relaxation, I mean, like not in the sense of like, Oh, I'm so relaxed now, but more in the sense of like tension and relaxation, you know.

So inevitably everybody rallies on the crisis in the initial mitigation and that actually drains energy from the longterm root cause analysis and the long-term mitigation. And if you couple that with the fact that also, "Hey, you're trying to build features and ship value to your customers..." it makes that closing of the long loop really, really difficult. And that I think is actually the core resistance that organizations have to break through in order to make it to the real way that you build a resilient product architecture, network, culture, that really can like reach a maximum state of effectiveness.

So that's an interesting thing to talk about a little bit But then let me, talk a little bit about some of my, personal experience.

When I was at Sonic net, you know, I was again, but like kind of really the only person, even as I started to build the team who really understood the systems. So I was on call and I was really the one who to deal with just about anything that happened. And this is super shameful, but there was a problem that I let sit around for like at least six months, maybe it was more like seven or eight months. And I knew exactly what's happening and it had to do with the way we served ads and we basically like logged a bunch of stuff related to the way we served ads in memory for various other reasons, our application would periodically be running out of memory. So we would end up dumping a bunch of impressions on the floor. Which costs us money. That was money out of our pocket.

And I was struggling with this problem. First it was like a little, and it wasn't that bad it'll happen now and then. And then as to go to this extent, do they get worse and worse, there was more and more, more. And I was a really in a, in an absolute panic.

And I was so stuck and I like cared a lot. And, I cared about this company. I was really trying my hardest. And I often think back as to like, how I got so stuck in that situation. And like why it was so hard for me to like, share what was happening, really honestly with my management and about the people around me, really get the help that I needed to sort this problem out.

And part of the answer was it was actually inseparable from the responsibility that I felt. And the guilt that I felt over this problem. And the pressure that I felt to solve it. Those things actually were contributing to my inability, to break through this feeling of basically of shame.

And, And when I finally got through the other side of that I thought about it a lot. And I was like, "This is the thing that can't happen. I could never let this happen to someone else in my organization. And how are we going to do that?"

we going to do that So you know, that was a formative experience for sure. And one that I think about to this day

Patrick Gallagher: Wow. To understand the emotional impact of dealing with big outages or challenges like that. And then your ability to recognize that and say, well, like what, that's not gonna happen to my organization. I think is such a powerful experience.

for the last few weeks, one of the biggest problems that engineering leaders that I've been in conversation with have mentioned, is " We have an under-resourced team. We have one senior engineer who's the only one who knows the whole system. And if they leave... then that's it. We have no idea how things are working and if we experience an outage... we just are going to be so delayed into how to fix it, it's going to be incredibly painful..."

Do you have any insight or advice on how to set up a structure, a process for those people that are reliant on that one senior engineer who knows everything and they don't have any other people to turn to when a system experiences an outage?

Building a resilient system & transitioning away from heroics to a more mechanistic incident management model

Nick Rockwell: Oh sure. I have tons of advice. And the problem is the solutions are obvious. And yet they're extremely difficult to actually implement when you're in that situation. When you're living in the world of scarcity and you're trying to get your business up and running, and you're also, dealing with these kind of growth pain type issues. So, I mean, the answer is straightforward and it's you can't have that situation. You can't have that high of a bust factor on someone.

So that person's number one responsibility has to be exporting the knowledge that they have to others on the team.

Now, there may not be others on the team who actually can receive that knowledge. So then you have to go get some. You may need another, equally senior engineer who can come in and really absorb that information.

But again like all these things are pretty easy to say and, and pretty hard to do. So it's advice, but I don't know how useful it is.

But I think inevitably, it's one of the survival gates of your company in that situation. Like if you are unable to find a way, you're probably not gonna make it.

But the other thing that that sort of indicates is, I think actually that companies sometimes, the ones that make it are the ones that passed that threshold. And the reason that they do is because there's enough of a sense of mission, right at the outset. That that key person doesn't leave. They don't leave. I think a lot about that sense of mission when I think about these things, because operational life can't be compensated for sufficiently. For it, or basically like... the level of stress that person you're describing is under... and the amount of time whether it's actual time or like, virtual time in the sense that, they could be paged at any moment and therefore always have to be living in that kind of mindset and that head space. There's no amount of money that makes that a good way to live. You only do it because you have A certain personality type and B like, you're really invested in the mission that you're involved in.

And again, like what is so unfortunate about that is you need that to survive. And yet at a certain point in your journey, that's exactly the thing that you have to reject. Again, the heroic model breaks down becomes your limiting factor at some point, and you have to transition to a absolutely emotion free mechanistic model. You know, where it's all process. It's all of these processes that you relentlessly follow that's turn that flywheel of continuous improvement.

And I'm still like, thinking about this and wondering, "Am I right about this or not?" I'd love to hear what you guys think, but that's my working theory at the moment. And we can talk a little bit about why. And I think like that's a, quite a large level of scale that, that becomes an appropriate thing to be trying to effectuate. And not that many companies ever really ever got there at all.

But I've been thinking a lot about that transition, how that works and how you jump from one mode which is you know, the heroic mode to this other mode where it's almost takes like choice or human agency, out of the picture. But it's that kind of relentless and sort of unavoidable turning of the crank or unavoidable feedback loop that gets you actually to the next level.

Patrick Gallagher: Absolutely love to hear how you think about making that transition because I think going from heroic model to that next more mechanistic model. I think the transition piece is I imagine people struggled to figure out what's my next step. What's my first step to begin making that conscious choice.

Jerry Li: Yeah, especially from your company Fastly because it's, providing a survey so critical the requirement on resilience and availability is so high. I think people will love to hear what are the, processes you have around that.

“The long feedback loop” of incidents

Nick Rockwell: Well, that's exactly right about Fastly. And this is something I've thought about, too. And when I say that, not that many companies necessarily even ever make that transition. It's because resilience doesn't mean the same thing to every company. Obviously.

And this is kind of what I meant when I alluded to the, the tendrils of resilience when you start pulling on them and following them, like lead even to business model. Sometimes that's an industry question the thing we do. Hey, it's important, but it's not out of work where it's like, somebody else is super important.

But often it's business model. And I spent a lot of my time in the, media business and in an ad supported model. That's a place where a fair amount of failure is actually pretty tolerable. And almost at least for the first 20 years of the internet, almost expected. You know, like it was kinda like pretty typical, like the way the internet worked. When you look at the history of even some of the bigger companies and I think of Twitter, for example, where, they went through long stretches of being terrible. It didn't kill them.

But obviously for a company like Fastly, it's completely different. And like, beyond the fact that what we do is critical in the, and the level to which our customers rely on us. Resilience is actually our product. We sell resilience and we're like part of the resilience strategy of our customers. We're not even just one of the things that makes them resilient. We're like part of that strategy, like a critical leg in that structure. So we have no choice, but to battle through to that stage.

So what makes it hard and what does it look like?

To me, it comes back to that that question of closing the long feedback loop. There's a short one where you're monitoring metrics go crazy, something's broken, you fix it. Metrics go back to normal, it's like, sort of think of it as like the short feedback loop.

The long one is you follow through, you discover, you find the root cause, which is rarely like, "Oh, I found the root cause!"

It's usually like, Here's the like 22, factors that together we can like call the root cause, that are out of balance in some way, tends to be much more complicated.

But anyway, like, then feeding that back in, not even just to development, I'm not even talking about " Hey, we found this root cause and there's a bug and we need to get fixing that bug onto the roadmap."

It's more like feeding back into the architecture. So it's a really long loop. It's like feeding back into the architectural thinking, having that then flow through your whole development process all the long time cycles that rearchitecting requires. So these might be resolutions that take a year or more than a year to actually effectuate. So that's what I mean about the long feedback loop.

And it's really, really hard to sustain the energy on that long feedback loop. And the way that it happens. It doesn't happen through the heroic energies of individuals. It happens through a process with a long memory and many defenses to keep it self from being derailed.

And that, again, that requires a level of process investment that transcends the influence of a head of engineering, you know... transcends the influence of a principal engineer on the team... transcends even the local imperatives of the business at a moment in time. It has to become something that like gravity or the movement of tectonic plates.

actually has to become this process that honestly, that you probably couldn't stop if you wanted to. Which is A dangerous to set up and B like difficult to set up. But that's, I think the transition that needs to take place.

So, so it comes through again, replacing painfully that culture of heroism with a process heavy approach that that really relies on rigor, on balances to make sure that the process stays intact.

Grappling with the risks of a more process-driven, mechanistic model of incident management

What are some of the cultural norms or process norms that you've seen help sustain that long feedback loop that you've seen work and help contribute the transition from heroism to the process oriented culture?

Nick Rockwell: I like I'm at a certain point in my journey here, and I'm going to say that I've observed things that I think work. But I'm still in sort of a theory mode to some degree. And we're testing some of this stuff out. It's dangerous stuff. And it's dangerous stuff because we're actually talking about again, when I talk about process that is difficult to change We're talking about like the building blocks of bureaucracy. And we're talking about, again, processes that can not easily be overridden, so it's dangerous.

And if you get it wrong, you can harm the organization and in many ways. But those are the kinds of ingredients that we need to see.

We need to see bureaucracy in the sense of that adversarial structure, where they're like, "This team requires this team to do this and that team is requiring this team to do this.

And that's depending on this other team doing that..." so that they're all kind of leaning up against each other. And a particular leader in one part of the system, "Thinking, like I want to change things." Actually can't change the whole system and potentially derail that process. But that's dangerous because if you got it wrong, you're pretty stuck.

And you know, this is one of those places where like we also know what what the negative side effects look like. And you think about large organizations that have been obsessed with safety. You might think about like Nasa. Or you might think about like aviation or even medicine. And you start to see the kind of price that gets paid for this level of reliability.

And it's a price in terms of Well, like say in aviation, it's like com takes the form of external regulation. We're talking about internal regulation. But that is sort of equally difficult to unwind. Similarly in medicine, we talk about conservatism that stems from the liability model. That makes it extremely difficult to change things in medicine.

So one of the things I'm grappling with is what are the strategies you can use to like counter those risks? The risks of creating machine that absolutely keeps turning and you know, drives continuous improvement. It does make you more and more reliable. But at a price you're unwilling to pay in the long run. That actually becomes too much of a drag on innovation and change.

We have a toolkit that, we use to balance, changing a system and keeping it stable to a degree. In is there like the toolkits of software development. But one question I have is, I don't know how they work. I don't know how they might break down.

In this world where the investment in the inertia in this basically bureaucratic structure, that's keeping that continuous improvement playbook going. I don't know how that breaks down. Or processes that ensure a safe change in software development. Because most of those processes have been generated in companies at an earlier stage in their development.

So that's one of the things I'm thinking about right now.

Jerry Li: What are some of the tactics you can share with our audience in terms of how do you set up your on-call rotation? How do you divide the responsibility. And what are the processes in place so that it can ensure coverage so that it's not one person it's the whole team? And a process to make sure things will be covered extreme situations?

Dedicated vs. distributed incident response teams & how incident management evolves over time

Nick Rockwell: The first hard part is that there aren't enough people, right? You don't have enough people to create an on-call rotation that is fairly barbaric. Then, the next problem is you've started to invest in resilience because you're having problems, but you're too early in that process to get the benefits. So, you're getting called a lot!

And what might be a manageable rotation on a relatively stable system, it's unmanageable because it's so constant and relentless and nobody's getting any sleep, and so on.

And then the other problem we've already talked about a little bit, which is how do you get the knowledge out of one person's head. A lot of times you have, Hey, there's four people on the rotation, but only one of them actually knows how to fix anything. So the other two people just call that one person, you know, how do you break out of that problem?

And I think, the solutions again are easy to describe and hard to actually implement. The first one was get more people. But what I would say there is it can be hard. It could be hard to make that investment and to convince perhaps your management to make that investment.

And yet you just have to. Like one of the things we always wrestled with is what's the balance between a team that's specialized with dealing with problems and the teams that built the software in the first place. And how do you balance those things? And I think myself vacillate on that question a lot. And I think a company growing through a number of phases actually probably should tack back and forth on that one a little bit.

And I think if stage one is we just have a few people who are, we're doing our best. And Actually we're not, in a great place. Next step is to set up a team that does first level support. And that's going to get the first page and then hopefully over time, as you're able to give them more knowledge, can deal with more situations and become a buffer.

So it's just more efficient to staff one team centrally that supports many teams and handles that first level of support. What I don't like about that model is how much of relieves actual development teams of the burden of dealing with their own problems that they create. So that always kind of rubs me a little bit, the wrong way. And I want to get back to a situation where teams do their own support. At a certain level of scale, you can go back to that actually. Because then each team is sort of big enough they can support a proper on-call rotation. So I think you tack back in that direction.

In the long run, I'm becoming convinced that they have to be separated again. And this is more like the SRE model. Where you have a highly capable team that is responsible for operations and actually has all of the skills of the development team. Maybe is even more experienced and is able, to not just react, but to actually drive reliability throughout the systems.

But that is a separate team. There's a lot of precedent for that, but that's the model that you get back to. So I think there's a little bit of a motion back and forth there, between this like dedication and consolidation around an operational team and this distribution of responsibility.

I think the real challenge... of the reasons it's so hard to create a great SRE team is it's the generalization of that problem. Like the team that built the software has an advantage... you know! They know the things that they know. But to have a team that works with, software in general that they did not have a hand in building that they may maybe due to that maybe, they may be experienced with or not. But that could generalize the approach and the process and be effective. That's a difficult team. That's a very experienced team that you know, that takes a while to mature. But it's very powerful when itcan be done.

Patrick Gallagher: As you were sharing Nick, there was a, a dilemma that is sort of come up about balancing both the individual responsibility and accountability of the teams and also the cultivating, the culture of blamelessness. And I think that different engineering teams approach the cultural norms differently there.

And so it's wondering, what are your thoughts about how do you balance both people taking individual responsibility and accountability for some of these resiliency issues and also cultivating a culture of blamelessness where you're able to optimize and maximize the learnings that come out of it in a way that still preserves that culture, psychological safety and orientation towards learning

Balancing individual accountability and a culture of blamelessness

Nick Rockwell: It's a really interesting question. And it's one that I've been thinking about a lot. I think I want to be, careful because I don't want to be misunderstood on this. I guess I would say the most important thing, the thing that everyone should remain focused on is that search for the truth. We're here to find the truth. We're here to get to the root of what's actually happened here. And that, needs to be our guide. All of our guides. Whether I'm on the front lines, whether I'm a executive The thing that I think perhaps I'm tempted to say that in engineering circles, we've swung a little too far towards protecting psychological safety. And that's what I want to not be misunderstood on...

I think if your concern was psychological safety, if you're concerned with protecting, the people on the front lines, which is really important. You still have to remain focused on that search for the truth. That still be your North star.

I've seen situations where those things have been in conflict where the desire to protect people and create a safe space and bring that critical ingredient to the search for the truth has actually obscured the truth. It's actually prevented us from finding the root cause. And that's that can't be acceptable. Like we can't accept that.

So how do we reconcile those things? I think there's a couple of ways. I think one we have to understand that trust is always a function of time. and that if you extend credit in the form of trust, that periodically you have to extend credit in the form of trust and then you have to observe and see what happens. And if your trust is violated, then that's bad and you've learned something. But hopefully it's not, hopefully it's preserved.

And I, I guess what I mean by that is. Sometimes the pursuit of truth takes us into dangerous territory and we have to trust that we're going to stay true to our principles. And like, we're not going to revert to punishment, for example. But we're going to have to, feel uncomfortable while we do that. And then related to that, we need to preserve the safety of the people on the front lines. But we're going to have bad feelings. We will feel bad because there was a failure here. There was a failure of some kind. Whether or not it was caused by an individual or some like unexpected, unpredictable systemic effect. There was a failure. And we should feel bad about the failure.

And again, I would say that's true while we're in this heroic phase where we're basically relying on emotion to get us through this work. And that context is where 99% of organizations are. We're going to feel bad. There are going to be bad feelings. I'm going to feel bad as an executive.

You know, The people on the front lines are gonna feel bad. The business is going to feel bad. We're going to feel bad that we let our customers down. and if we can't protect ourselves from those bad feelings. We actually need in that mode again, where we're relying on emotion to like drive us. We can't avoid it. So that shouldn't be our expectation.

We should have an expectation that, if you made a mistake and it it was not negligent or deliberate, or, you know, malicious that you will be protected. You're not going to get fired. You're not going to get punished. But you probably will still feel bad. Nobody wants to feel bad, so that actually is still pretty challenging.

But those are a couple of things that I think about. And then what we do as we try to push through to that other state, we actually don't feel bad anymore. Because in a lot of ways, the timescale that we're working at we can't sustain those kinds of emotions over the timescale anyway. Right.

And that's one of the problems with even bad and feeling bad and wanting to feel better. That drives us to fix something in the moment, but it doesn't sustain us over an 18 month re architecture cycle.

So AdWords are more evidence that as you get to that higher state, we actually have to leave the emotional motivators behind. I would find other motivators. They have to be what I see as the best candidate as a successor is this almost automatic response where it's almost like this is just what we do. It's almost, deep cultural or almost I want to say like religious, this is just what we do!

So when we get there, that's nice. We get a break from this constant rollercoaster ride, a bad feeling. But unfortunately in the meantime, it's part of how it works.

Patrick Gallagher: For the teams that are still sort of in transition from moving from a heroic mode to a more methodical process... how do you process the residual emotions around those types of incidents?

Why you need to talk about incidents and process their residual emotions

Nick Rockwell: That's a great question... and the answer is. Many different ways in different contexts for different people in different organizations. But the important thing is to process them. I think one of the things we've been doing is we've been talking a lot more about the incidents that we have

And there was some resistance. There continues to be some resistance to that. Maybe we don't want to talk about them too much because we don't want some people to hear about it, whether they're people inside the company or our customers, people outside the company. Maybe we don't want to talk about it because we don't want to make someone feel bad who was involved. And we don't want to undermine our psychological safety.

I think it's really important to talk about the incidents. And I think it's important to do it in a group setting. And I think it's important to do it carefully, to maintain, focus on the right things and to not have it be about ever about blame or public shame. That's never what we're after. We are after is processing those emotions. And we do that for a couple of reasons. One is that if we don't process them, frankly, they're traumatic to a degree.

And that's a big word. And there's levels of trauma, but technically they're traumatic. If we don't deal with them, we don't talk about them. They're still there. They're just not being confronted or named or shared in any way. So everybody's carrying that burden in different ways. Without any support from the community or much less support.

And, And everyone has a different sort of necessarily distorted view that people might blame themselves too much. They might blame someone else. But everyone has an imperfect understanding of what actually happened...

So I think it's really important as a community to talk through the incidents. Again, in a carefully constructed way. This is a structured conversation. But I think it's necessary so that it can be processed. We can move it out of this, subconscious trauma so like something that we've explored and worked through a little bit.

And that also frankly, supports the learning process. And supports it in potentially unexpected ways. Sometimes the learnings are not just "Why did the system fail?"

They're like, "Why did this person think that this thing that caused them to take this action when there was a more appropriate action to take? What in our communication, or even in our relationships contributed to this incident?"

So difficult to do I think the exact way and the modality to go about it, probably varies in different organizations at different stages, but finding a way is definitely part of what I feel is my job as a leader in a resilience, obsessed organization.

Patrick Gallagher: The image you gave of, people carry the emotions with them, regardless of whether you address them or not. And I think that's so true. And I think it extends beyond just the emotions around an incident, but it's also the emotions, people feel around life, that's happening to them outside of work.

And that if you are failing to address those or create conversations or space around those, then the weight is going to be there and it's going to impact people's work regardless. And so you might as well do your best to acknowledge and support people through those conversations.

Nick Rockwell: Absolutely. And that, starts to point towards an even bigger form of resilience and question around resilience that's obviously been in all of our minds for the last. 12 or 13 months. Bigger topic, maybe a topic for another day, but there's a lot of interesting parallels and relationships to be explored there.

Patrick Gallagher: One other topics I was hoping we could cover Nick, before we wrap up you know, since we're kind of at the close of our conversation, It felt most natural to talk about the postmortems of outages and incidents towards the end of our conversation.

And so I was wondering if you could share some of the things that you've learned about postmortems, especially when you're talking about, how do you maximize the learning for both closing the short loop and closing the long loop?

wondering if you could share just a little bit about what you've been thinking about postmortems and how to, maximize those opportunities, both from a process standpoint and from an emotional standpoint?

Maximizing post-mortems for process improvement and processing emotions

Nick Rockwell: Yeah, separate questions with the process answer and the emotional answer are separate.

The emotional component, has to have closure. And that closure is always temporary. It's like until the next incident, but it's necessary. Our point here is to say what we had to say, feel what we have to feel and move on. I don't think we structure that as a post-mortem conversation so much cause in that conversation, our primary goal is not to unpack what happened. It's just really to relate how we felt while whatever happened happened. You know, and to compare and share our experiences. So I think that's necessary

What I worry about with problem solving and the root cause seeking part of this is there's too much closure in the typical post-mortem that we go through. And always going to be focused on closing the short loop. And it's one of the reasons that actually makes it harder for us to maintain momentum on the longer.

So I don't have an answer here yet, and I'd love to hear perhaps from listeners or others out there, what they've tried. I'm thinking in terms of You know, how do we replace the incident driven post-mortem with something that is more durable and cyclic?

And one of the things I'm thinking about is that we started doing a biweekly meeting you know, which we, we talk about resilience. We talk about it through a number of different lenses. But one of the ways is we revisit everything that has not been closed. Whether it's a year old, or it's a day old, whether we're in the short of loop or the long loop, we're forced to keep coming back to it. I don't trust the closure in the diagnostic process. Cause I think we tend to close things off way too early. And I'm trying to think how to keep them open longer or allow things to be partially closed, but partially open, this part, we understand that part we still don't know what happened. Or that part we've resolved at this point we haven't resolved. And we can't even start resolving it for another six months until this other thing is done.

So how to move away from that incident based post-mortem to something that's more like a continual revisiting of every thread or pathway that's been opened until they're not even open anymore.

That's So That's the lines I'm thinking along.

Patrick Gallagher: That's like a really powerful industry philosophical question to pose and to float out to people to think about and reflect on as wrap up our conversation.

Nick, I just want to say thank you. I mean, I'm reflecting on all of the different things that we've covered in this conversation. And I feel like it's been parts of many things. It's been part philosophically how to think about an approach, incidents outages and resiliencies from just a philosophical and thought process perspective. And then also diving into some of the actual operational components and things to consider there. And I think it's been a really powerful balance between the two.

Nick Rockwell: Thank you. I've really enjoyed it. Weirdly, I could talk about this all day and I have 10 other kind of things I didn't get to though that I'd love to talk about another time!

It's a very deep and rich topic. And partly because we live it so viscerally. You know that if you're in the business long enough, you start to really come back to it again and again.

So I really enjoyed the conversation.

more to listen

How OpenAI’s engineering org is reshaping teams, roles and workflows

with Sulman Choudhry

The innovation engine behind Samsara driving real-world impact: compounding feedback loops, data flywheels and embedding engineers in customer problems

with Kiren Sekar

Home for engineering leaders