Slight Reliability

Stephen Townshend•www.buzzsprout.com

Technology

Learning SRE, one day at a time.

Last refreshed: August 21st, 2025 at 8:05 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Burnout with Colette Alexander (Episode 103)

Send us a text Have you burned out at work? What was your experience? How did you work through it? This week I'm joined by the incredible Colette Alexander to discuss what burnout is, what it means, and we both share our personal experiences burning out at work. We cover... 🔥 What is burnout? ❓ Why does it happen? 🫀 What are the symptoms? 🥊 Fight, flight, or freeze 🧑‍🚒 Advice on how to recover ...and much more. Resources from the show... Why you're so angry at work (and what to do about it)...

Aug 12, 2025•39 min•Season 2Ep. 103

Mobile Observability with Hanson Ho (Episode 102)

Send us a text This week I'm joined by the wonderful Hanson Ho to discuss the unique challenges and opportunities in making our mobile apps observable! We cover... 📱 The mobile/backend observability divide ✍️ The challenge of distributed tracing on mobile apps 🌏 The entire device runtime environment matters for your app 👤 The quest for user-centric mobile observability ✅ Advice on how to get started with mobile observability ...and much more. You can find Hanson on: LinkedIn: https://www.link...

Jul 29, 2025•32 min•Season 2Ep. 102

Intro to Resilience Engineering with Michelle Casey (Episode 101)

Send us a text This week on the I'm joined once more by SRE leader Michelle Casey who gives a broad and shallow introduction to resilience engineering. We cover... 🏋️‍♀️ Reliability VS Robustness VS Resilience 🧩 What is a complex system? 🔢 Safety one/safety two 🧠 Mental models 😩 Human error ...and so much more. Resources from this episode: Four concepts for resilience (paper) by Dr. David Woods https://www.researchgate.net/publication/276139783_Four_concepts_for_resilience_and_the_implicati...

Jul 15, 2025•40 min•Season 2Ep. 101

Learning with John Allspaw (Episode 100)

Send us a text This week on the 100th episode I'm joined by DevOps and Resilience Engineering legend John Allspaw to talk about learning (especially from incidents). We discuss... 📒 Classroom VS situated learning 🤝 The myth of the perfect handover ITIL as a coping strategy to try and make sense of the organic, wild, and messy 🥕 How you cannot incentivise to avoid incidents (it doesn't work that way) ❤️‍🩹 You can't understand how something is broken unless you know how it's supposed to work i...

Jun 24, 2025•48 min•Season 2Ep. 100

Focusing on What Matters with Trent Hornibrook (Episode 99)

Send us a text This week I'm joined by SRE leader Trent Hornibrook who shares a story about how he improved on-call early in his career, and then we explore the broader theme of focusing on the things that matter in observability, incident response, on-call, and beyond. We discuss... 🔌 Empowering engineers to implement change in your org 🧑‍🍼 Focusing on what matters (customer & business > technology) 👀 Not just adding more monitoring as the output of each PIR 😎 How autonomy can lead ...

Jun 03, 2025•29 min•Season 2Ep. 99

The Root Cause Fallacy with Andrew Hatch (Episode 98)

Send us a text This week I'm joined by SRE leader Andrew Hatch from Cisco ThousandEyes to talk about a dirty word in the resilience community... root cause. In this excellent conversation we explore... 🌌 Is the root cause of every incident the big bang? 🦖 How the value of root cause degrades as complexity increases 🫣 That if the culture is not blameless, people will hide things 🌳 Alternative approaches to root cause analysis such as branching timelines 🙋 Getting someone without skin in the ...

May 20, 2025•32 min•Season 2Ep. 98

Synthetic Monitoring with David Dick (Episode 97)

Send us a text This week I'm joined by David Dick from 2 Steps to (finally!) discuss synthetic monitoring. We cover... 🤖 What is synthetic monitoring? 🦾 What are the benefits and drawbacks to using it? ☢️ Non-web based synthetics (the tough stuff) 🍹 Combining RUM and synthetics 🫢 Does synthetics need an OTEL-like framework? ...and much more. You can find David on: LinkedIn: https://www.linkedin.com/in/david-dick/ You can find more about 2 Steps at https://2steps.io/# You can find Stephen on:...

May 06, 2025•33 min•Season 2Ep. 97

Tech Leadership with Milan Brown (Episode 96)

Send us a text This week I'm joined by Cin7 Engineering Director Milan Brown to unpack the challenges of technology management and leadership. We discuss... ✖️ Theory X vs Theory Y management 🗣️ Intention based leadership and communication 🏢 Conditions in an org for people to thrive 😵‍💫 How do you learn to manage and lead? 🫤 Managing people when you're not an expert in what they do ...and much more. Resources mentioned during the episode: Turn The Ship Around! (book): https://davidmarquet.c...

Apr 23, 2025•31 min•Season 2Ep. 96

Finding Tech Work with Leon Adato (Episode 95)

Send us a text This week Leon Adato and I break down the state of applying for roles in tech. We cover... 📝 What a resume or CV is and is not 🤝 Leveraging your connections rather than relying on applying cold 🪄 How most job descriptions are works of fiction 🦾 White-fonting to game AI resume assessment 🧪 Experimental ways we could recruit ...and our pitch for Kubernetes the Rock Opera (and much more) You can find Leon's job postings weekly on his website: https://www.adatosystems.com/categor...

Mar 29, 2025•36 min•Season 2Ep. 95

Getting a Start in SRE with Priyam Kumar (Episode 94)

Send us a text This week Priyam Kumar shares his story of moving from a massive organisation to a startup and the challenges and growth that came from that. We discuss... 🪖 War stories and examples of production incidents 🩹 The "hacks" we build to keep things running (and how maybe that's just normal) 😎 Keeping it simple... YAGNI (You Ain't Gonna Need It!) 🧯 The perils of getting stuck in reactive mode 📖 Areas of of learning if you want to get into SRE ...and much much more. You can find Pr...

Mar 22, 2025•31 min•Season 2Ep. 94

SRE Leadership with Michelle Casey (Episode 93)

Send us a text This week Michelle Casey shares her insights as a 'head of' engineering manager in the SRE context. This was one of my favourite conversations on the podcast so far. We cover topics such as... 🤷🏽 Why move into leadership? 👁️ Learning from other leaders 💎 What is unique about SRE leadership? 👑 Women in engineering leadership ...and we go through some feedback I got as a leader recently. Resources that Michelle mentions during the episode: The Five Dysfunctions of a Team (book)...

Mar 11, 2025•39 min•Season 2Ep. 93

Observability Maturity with Ádám Tóth (Episode 92)

Send us a text This week Adam and I get philosophical about what constitutes maturity in the field of observability. We tackle questions such as... 💸 Does your org treat observability as a cost centre or a value add? 🔥 Are you using observability reactively to solve problems? Or proactively to build better products and services? 👤 Is your observability connected to your users and business in a meaningful way? 🌐 Is monitoring the social media sentiment of your product part of observability? ....

Feb 25, 2025•30 min•Season 2Ep. 92

Head in the Clouds (Episode 91)

Send us a text In this episode I explore the challenges of achieving unified observability when integrating with SaaS products and services. I cover: 🌊 The new wave of mega-complex SaaS ⚗️ Challenges integrating SaaS with our observability pipelines 👩‍🦯 How the lack of SaaS autonomy limits the effectiveness of OpenTelemetry 💰 Paying twice to ingest, store, and search telemetry 📈 Monitoring and predicting SaaS observability costs ...and much more. Shout out to Mark Chiavaroli (and apologies ...

Jan 21, 2025•16 min•Season 2Ep. 91

Non-Prod Reliability Engineering + 2024 Wrap (Episode 90)

Send us a text This week I check in and give an update on work, life, and my attempts at bringing to life SRE practices in the world of non-production environment management. You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre YouTube: https://www.youtube.com/c/SlightReliability Instagram: https://www.instagram.com/slight_reliabili...

Dec 10, 2024•18 min•Season 2Ep. 90

Slight Reliability Episode 89 - Blameless Post-mortems with Karanveer Anand

Send us a text This week I'm joined by Karanveer Anand, SRE Technical Program Manager at Google to discuss blameless post-mortems. We cover: 🦅 The recent Crowdstrike outage and their public post-mortem 🚑 When do we do a blameless post-mortem? 😕 How do we do a blameless post-mortem? ✅ How do we make sure action items are followed through? 📰 The power of learning from post-mortems created by other teams and orgs ...and much more. You can find Karanveer on LinkedIn: https://www.linkedin.com/in/...

Sep 03, 2024•26 min•Season 2Ep. 89

Slight Reliability Episode 88 - OpenTelemetry Revisited with Zach Michel

Send us a text This week Zach Michel from https://middleware.io/ and I discuss the state of OpenTelemetry and what it means to adopt it. We cover: 🌩️ Achieving observability in a SaaS world 🥫 Context propagation - the magic sauce of OTEL 🚪 The telemetry gateway concept and leveraging the OTEL collector 🪵 The state of OpenTelemetry logging 🫂 Making use of the OpenTelemetry community ...and much more. You can find Zach on LinkedIn: https://www.linkedin.com/in/zamichel/ You can find the offici...

Aug 27, 2024•27 min•Season 2Ep. 88

Slight Reliability Episode 87 - Measuring the value of SRE with Artem Yakimenko

Send us a text In Episode 80 Niall Murphy talked about the need for SREs to be better at articulating the value of our work. In this episode I'm joined by ex-Googler and Engineering Director (SRE) at Culture Amp Artem Yakimenko about how we might achieve this. We discuss both quantifiable and qualitative approaches including leveraging the untapped data in support tickets, customer sentiment and rankings, the relationship between finance and performance, the link between user design and performa...

Jul 24, 2024•36 min•Season 2Ep. 87

Slight Reliability Episode 86 - Evolving SLOs with Dom Finn

Send us a text In the world of SRE we constantly talk about defining SLOs, but what about evolving them over time? This week I chat with SRE Tech Lead Dom Finn about just that. We cover the relationship between reliability and user analytics, latency classes as a way to speak SLOs with business stakeholders, the role of NFRs and how the thresholds differ from SLOs, and much more. Books mentioned in the episode: The Beginning of Infinity: Explanations That Transform the World By David Deutch http...

Jun 08, 2024•26 min•Season 2Ep. 86

Slight Reliability Episode 85 - Feeling SaaSsy

Send us a text This week I talk about the impact of SaaS-first technology strategies on the work of an SRE. I pose questions about observability, ownership, on-call, and how much control we have over reliability. You can find the Bleeding Tech blog on Medium: https://medium.com/@stownshend You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre YouTube: https://www.youtube.com/c/SlightReliability Instagram: https://www.instagram....

May 02, 2024•11 min•Season 2Ep. 85

Slight Reliability Episode 84 - Clinical Troubleshooting with Dan Slimmon

Send us a text This week I chat with Dan Slimmon about applying the approach doctors use to treat patient symptoms during incident response. You can find Dan's blog at https://blog.danslimmon.com/ or connect with him on LinkedIn here: https://www.linkedin.com/in/danslimmon/ You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre YouTub...

Mar 30, 2024•28 min•Season 2Ep. 84

Slight Reliability Episode 83 - An Unfulfilled Promise with Itiel Shwartz

Send us a text This week I hear about all things Kubernetes from Komodor CTO and co-founder Itiel Shwartz. We chat about the promise that was made when Kubernetes first entered the industry, the challenge of getting developers engaged and capable of working in Kubernetes, my hate/hate relationship with Helm but its important contribution to the Kubernetes project, Kubernetes observability, and so much more. You can find the Kubernetes for Humans podcast here: https://komodor.com/blog/the-kuberne...

Mar 05, 2024•31 min•Season 2Ep. 83

Slight Reliability Episode 82 - CI/CD with Amin Astaneh

Send us a text This week I sit down and have a discussion with Amin Astaneh (from Certo Modo) about CI/CD. We cover the power of the standard change as a way to navigate ITIL while still implementing DevOps practices, what to monitor to make your CI/CD observable, single piece flow, testing in production, and so much more. You can find Amin on his company website https://certomodo.io , LinkedIn: https://www.linkedin.com/in/aminastaneh/ and Twitter: https://twitter.com/aastaneh You can find the o...

Feb 13, 2024•26 min•Season 3Ep. 2

Slight Reliability Episode 81 - Incident Management in Non-Prod Environments

Send us a text "Environment issues are just incidents that happened to occur in a non-production environment"... so why do we treat them so differently? In this first episode of the 2024 season I reflect on how we handle incidents in non-prod environments. (Note: Had a few issues with noise suppression in OBS Studio cutting off the start of some words, will sort it for the next episode) You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/...

Feb 06, 2024•10 min•Season 3Ep. 1

Slight Reliability Episode 80 - What's Been Bugging Niall Murphy

Send us a text This week I speak with co-author of the original SRE book + the SRE workbook, and renowned speaker Niall Murphy. We chat about the state of SRE in the current macro-economic climate and how we're not yet doing a very good job at articulating the value of SRE to leaders, the relationship that velocity and reliability have, the value of new features versus reliability improvements, and *much* more. You can find Niall at: LinkedIn: https://www.linkedin.com/in/niallm/ X: https://twitt...

Nov 22, 2023•37 min•Season 2Ep. 80

Slight Reliability Episode 76 - Sampling Distributed Traces with Paige Cruz

Send us a text Paige Cruz (from Chronosphere) is back. This week we discuss sampling. What is sampling? Why do it? What kinds of sampling are there? You can check out Chronosphere's cloud native observability platform here: https://chronosphere.io/ You can find Paige on: LinkedIn: https://www.linkedin.com/in/paigerduty/ X: https://twitter.com/paigerduty You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.l...

Nov 21, 2023•45 min•Season 2Ep. 76

Slight Reliability Episode 79 - Incident Story Time with Valeska Victoria

Send us a text This week Valeska Victoria returns to share some of her experiences working as an SRE at eBay. We look at the cascading effect of production issues in complex integrated environments (how there's often no single root cause), developer literacy of how infrastructure works, the importance of ownership and accountability of reliability, and much more. You can find Valeska on: LinkedIn: https://www.linkedin.com/in/valeska-victoria/ You can find the official Slight Reliability podcast ...

Nov 20, 2023•38 min•Season 2Ep. 79

Slight Reliability Episode 78 - Developer Experience with Ankit Jain

Send us a text This week I chat with Ankit Jain from aviator.co about developer experience. We define developer experience and developer productivity, and how this applies to SRE. We discuss the growing expectation on developers and how this leads to frustration and burnout. We also explore how to measure developer experience and how to start working to make improvements. You can check out Aviator's developer experience platform here: https://www.aviator.co/ You can find Ankit on: LinkedIn: http...

Nov 16, 2023•32 min•Season 2Ep. 78

December 2023 Update

Send us a text A brief mid-week update on my changing circumstances and the future of the podcast.

Nov 16, 2023•5 min

Slight Reliability Episode 77 - SRE to DevRel with Liz Fong-Jones

Send us a text This week I had the privilege of interviewing Liz Fong-Jones from honeycomb.io about DevRel, Developer Advocacy, and how that applies to SRE. We discuss the difference between Developer Relations (DevRel) and Developer Advocacy, how Liz got into advocacy, how DevRel helps companies and the community, and some tips on how to get traction with SRE practices in your organisation. You can check out Honeycomb's observability platform here: https://www.honeycomb.io/ You can find Liz on:...

Nov 15, 2023•32 min•Season 2Ep. 77

Slight Reliability Episode 75 - Enterprise SRE with Steve McGhee

Send us a text This week I had the honour of chatting with Steve McGhee (former Google SRE, current Google Reliability Advocate, and co-author of Enterprise Roadmap to SRE). We discuss the evolution of SRE from where it began at Google and how it is being adopted by enterprises around the world now (and why this is happening). We talk about getting leadership support and how we get reliability taken seriously, the lies we tell ourselves to justify incidents and issues, leveraging transformation ...

Nov 14, 2023•39 min•Season 2Ep. 75

Hosted on Buzzsprout

For the best experience, listen in Metacast app for iOS or Android