We cover resilience engineering & learning from incidents with John Allspaw, former CTO @ Etsy and current Founder & Principal @ Adaptive Capacity Labs! Co-hosted by Kenji Kiuchi (Head of Quality and Performance @ Postman) this episode also addresses common unintuitive perspectives within resilience engineering, strategies for effective incident response/problem-solving, how to identify current sources of resilience, and practical tips for implementing these resiliency tactics in your organization today.
John Allspaw (@allspaw) has worked in software systems engineering and operations for over twenty years in many different environments. John’s publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.” His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement. John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University.
"The competitive advantage is not for a leader to say, ‘Why did it take so long to restore this issue or resolve this outage?’ A competitive advantage is, ‘Oh my God, that is amazing. Tell me what made this hard and what are any of the things that made it difficult to resolve? Is there anything I can do to help get out of the way for people to do the work?’"
- John Allspaw
Kenji Kiuchi (@dr_kiuchi) is Head of Quality and Performance at Postman, an API platform whose mission is to maximize everyone's creativity through the power of connected software. There he leads a global team with a focus on maximizing user delight and innovating the practice of testing. Before coming to Postman, he spent several years ‘Helping people get Jobs” at Indeed. There, he worked on scaling teams and practice to optimize engineering delivery as well as leading Diversity, Inclusion and Belonging initiatives as an Associate Site Director. Prior to Indeed, Kenji spent several years as an Engineering Manager at Twitter where he led Quality efforts across monetization, growth, infra and the delivery of live video. When Kenji isn’t driving engineering excellence, he’s driving his motorcycle, spending quality time with his 3 daughters, and mentoring leaders across the globe.
Check out our friends and sponsor, Jellyfish
To learn more about Jellyfish and how they can help you increase engineering satisfaction and create happier, higher-performing engineering teams...
- John’s perspective on production (4:27)
- What drove John toward resilience engineering (6:22)
- How complex systems relate to resilience engineering (9:23)
- Differences between robustness and resilience (13:13)
- The role of productive adaptation in resilience engineering (17:26)
- Identify sources of resilience already present in your organization (22:52)
- Examples of unintuitive perspectives involving incident analysis (27:15)
- How to make room for unintuitive perspectives (31:41)
- Practical tips for implementing resiliency tactics & understanding incidents (36:12)
- Rapid fire questions (39:51)
LINKS AND RESOURCES
- Learning From Incidents Conference 2023 - This is a forum for sharing stories of incidents, incident handling, and the learnings from software engineers who handle large-scale distributed software systems.
- Hindsight and Sacrifice Decisions Blog Post on Adaptive Capacity Labs reaction to the NYSE halting trading to resolve an issue
- Using Language by Herbert H. Clark - Herbert Clark argues that language use is more than the sum of a speaker speaking and a listener listening. It is the joint action that emerges when speakers and listeners, writers and readers perform their individual actions in coordination, as ensembles. In contrast to work within the cognitive sciences, which has seen language use as an individual process, and to work within the social sciences, which has seen it as a social process, the author argues strongly that language use embodies both individual and social processes.
- Papers We Love Talk
- Visual Momentum