Reliability Enablers

Ash Patel & Sebastian Vietz•read.srepath.com

Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.

read.srepath.com

Last refreshed: July 15th, 2025 at 6:55 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

#66 - Unpacking 2025 SRE Report’s Damning Findings

I know it’s already six months into 2025, but we recorded this almost three months ago. I’ve been busy with my foray into the world of tech consulting and training —and, well, editing these podcast episodes takes time and care. This episode was prompted by the 2025 Catchpoint SRE Report , which dropped some damning but all-too-familiar findings: * 53% of orgs still define reliability as uptime only , ignoring degraded experience and hidden toil * Manual effort is creeping back in , reversing fiv...

Jul 01, 2025•30 min

#65 - In Critical Systems, 99.9% Isn’t Reliable — It’s a Liability

Most teams talk about reliability with a margin for error. “What’s our SLO? What’s our budget for failure?” But in the energy sector? There is no acceptable downtime. Not even a little. In this episode, I talk with Wade Harris, Director of FAST Engineering in Australia , who’s spent 15+ years designing and rolling out monitoring and control systems for critical energy infrastructure like power stations, solar farms, SCADA networks, you name it. What makes this episode different is that Wade isn’...

Jun 17, 2025•28 min

#64 - Using AI to Reduce Observability Costs

Exploring how to manage observability tool sprawl, reduce costs, and leverage AI to make smarter, data-driven decisions. It's been a hot minute since the last episode of the Reliability Enablers podcast. Sebastian and I have been working on a few things in our realms. On a personal and work front, I’ve been to over 25 cities in the last 3 months and need a breather. Meanwhile, listen to this interesting vendor, Ruchir Jha from Cardinal, working on the cutting edge of o11y to help reduce costs fr...

Jan 28, 2025•21 min

#63 - Does "Big Observability" Neglect Mobile?

Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he’s vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short. * Career Journey and Current Role : Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to...

Nov 12, 2024•29 min

#62 - Early Youtube SRE shares Modern Reliability Strategy

Andrew Fong’s take on engineering cuts through the usual role labels, urging teams to start with the problem they’re solving instead of locking into rigid job titles. He sees reliability, inclusivity, and efficiency as the real drivers of good engineering. In his view, SRE is all about keeping systems reliable and healthy, while platform engineering is geared toward speed, developer enablement, and keeping costs in check. It’s a values-first, practical approach to tackling tough challenges that ...

Nov 05, 2024•36 min

#61 Scott Moore on SRE, Performance Engineering, and More

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Oct 22, 2024•38 min

#60 How to NOT fail in Platform Engineering

Here’s what we covered: Defining Platform Engineering * Platform engineering : Building compelling internal products to help teams reuse capabilities with less coordination. * Cloud computing connection : Enterprises can now compose platforms from cloud services, creating mature, internal products for all engineering personas. Ankit’s career journey * Didn't choose platform engineering; it found him. * Early start in programming (since age 11). * Transitioned from a product engineer mindset to b...

Oct 01, 2024•31 min

#59 Who handles monitoring in your team and how?

Why many copy Google’s monitoring team setup * Google’s Influence. Google played a key role in defining the concept of software reliability. * Success in Reliability. Few can dispute Google’s ability to ensure high levels of reliability and its ability to share useful ways to improve it in other settings BUT there’s a problem: * It’s not always replicable. While Google's practices are admired, they may not be a perfect fit for every team. What is Google’s monitoring approach within teams? Here’s...

Sep 24, 2024•8 min

#58 Fixing Monitoring's Bad Signal-to-Noise Ratio

Monitoring in the software engineering world continues to grapple with poor signal-to-noise ratios. It’s a challenge that’s been around since the beginning of software development and will persist for years to come. The core issue is the overwhelming noise from non-essential data, which floods systems with useless alerts. This interrupts workflows, affects personal time, and even disrupts sleep. Sebastian dove into this problem, highlighting that the issue isn't just about having meaningless pag...

Sep 17, 2024•8 min

#57 How Technical Leads Support Software Reliability

The question then condenses down to: Can technical leads support reliability work? Yes, they can! Anemari has been a technical lead for years — even spending a few years doing that at the coveted consultancy, Thoughtworks — and now coaches others. She and I discussed the link between this role and software reliability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com...

Sep 10, 2024•32 min

#56 Resolving DORA Metrics Mistakes

We're already well into 2024 and it’s sad that people still have enough fuel to complain about various aspects of their engineering life. DORA seems to be turning into one of those problem areas. Not at every organization, but some places are turning it into a case of “hitting metrics” without caring for the underlying capabilities and conversations. Nathen Harvey is no stranger to this problem. He used to talk a lot about SRE at Google as a developer advocate. Then, he became the lead advocate ...

Sep 04, 2024•27 min

#55 3 Uses for Monitoring Data Other Than Alerts and Dashboards

We’ll explore 3 use cases for monitoring data. They are: * Analyzing long-term trends * Comparing over time or experiment groups * Conducting ad hoc retrospective analysis Analyzing long-term trends You can ask yourself a couple of simple questions as a starting point: * How big is my database? * How fast is the database growing? * How quickly is my user count growing? As you get comfortable with analyzing data for the simpler questions, you can start to analyze trends for less straightforward q...

Aug 27, 2024•11 min

#54 Becoming a Valuable Engineer Without Sacrificing Your Sanity

Shlomo Bielak is the Head of Engineering (Operational Excellence and Cloud) at Penn Interactive, an interactive gaming company. He’s dedicated much of his talk time at DevOps events to talk about a topic less covered at such technical events. A lot of what he said alluded to ways to become a more valuable engineer. I’ve broken them down into the following areas: * Avoid the heroic efforts * Mind + heart > Mind alone * Curiosity > Credentials * Experience > Certifications * Thinking for ...

Aug 20, 2024•37 min

#53 What's Missing in Incident Response Processes?

Incident response is an increasingly difficult area for organizations. Many teams end up paying a lot of money for incident management solutions. However, issues remain because processes supporting the incident response are not robust. Incident response software alone isn't going to fix bad incident processes. It's gonna help for sure. You need these incident management tools to manage the data and communications within the incident. But you also need to have effective processes and human-techno...

Aug 15, 2024•10 min

Can ITIL Benefit from Site Reliability Engineering?

According to Vlad Ukis, there are a lot of enterprises around whose IT functions are organized around ITIL. What you use SRE for is something completely different. SRE is not for setting up the IT function. It is for enabling the product organization to operate online services reliably at scale. However, the problem is that many in the industry are NOT using SRE principles but instead handing over complex services to a more traditional IT function. Dr. Vladislav Ukis is well qualified to talk ab...

Aug 13, 2024•29 min

#52 Navigating Complexity within Incidents

Sonja Blignaut is a complexity expert. That might not sound relevant to incident response in reliability engineering. But it is! Our systems are becoming more complex and so are the resulting incidents . Learning about complexity can help reliability folk go into an incident with less anxiety, which we’ll explore in this episode. We'll explore the causes of complexity in incidents and how the Cynefin framework classifies incidents. We'll also deep dive into the concept of complexity itself and d...

Aug 06, 2024•37 min

#51 Whitebox vs Blackbox Monitoring

Have you got complete monitoring of your software in effect? Are you sure? Google's SREs break monitoring down to white box versus black box monitoring. It's not the same as internal versus external monitoring, which we'll explore further. We'll cover topics like: - (quickly) What is monitoring? - What is whitebox monitoring? - What is black box monitoring? - The rising importance of blackbox monitoring This is a concept from Chapter 6 (Monitoring Distributed Systems) of the Google SRE (2016) bo...

Jul 30, 2024•10 min

#50 Making Better Sense of Observability Data

Jack Neely is a DevOps observability architect at Palo Alto Networks and has a few interesting ways of extracting value from o11y data. We crammed into just under 25 minutes ideas like these 7 takeaways: * Reasserting the Need to Monitor Four Golden Signals : Focus on latency, traffic, errors, and saturation for effective system monitoring and management. * Prioritize Customer Health : in Jack’s words, the 5th golden signal. Go beyond traditional metrics to monitor the health of your customers f...

Jul 09, 2024•25 min

#49 Alert Fatigue is Still an Issue - Here's How We Fix it

Alert noise is no joke and neither is the fatigue that results from it. I spoke with Dan Ravenstone who gave a talk at Monitorama about this very topic. He also happens to be an avid skateboarder! Here are 9 takeaways from our conversation: * Regularly Review and Update Monitoring Systems : Don’t set up monitoring once and forget about it. Continuously assess and update your monitoring systems to ensure they remain relevant and effective. * Focus on Relevant Alerts : Ensure your alerting system ...

Jul 02, 2024•30 min

#48 Cutting Down "Toil" aka Manual Work in Software

Sebastian and I scoured Chapter 5 of the Site Reliability Engineering (2016) book to find nuggets of wisdom on how to reduce toil. We hit the jackpot with concepts like: * what is toil according to a 5-point criteria * why even care about toil? * where you can find toil in your software system * Google’s goal for how much work (%) should be toil * the fact that toil isn’t always all that bad Don’t have time to listen to what we learned or added to the concepts? Check out the takeaways toward the...

Jun 25, 2024•44 min

#47 How to Grow Team Impact Through Learning Culture

The common refrain after an incident is “We could and should learn from this” . To me, that alludes to the need for a robust learning culture . We might think we already have a good learning culture because we talk about problems and deep-dive them into retrospectives. But how often do we explore the nuances of how we are learning? Sorrel Harriet is an expert in supporting software engineering teams to develop a stronger learning culture. She was a “Continuous Learning Lead” at Armakuni (softwar...

Jun 18, 2024•29 min

#46 Platform Team Design According to Team Team Topologies

I continue my conversation with Manuel Pais, co-author of the seminal Team Topologies book about team topologies suitable for reliability teams. In this second part, we will talk about platform teams. A quick refresher on what platform teams do In the team topologies context: Platform teams provide a curated set of self-service capabilities to enable stream-aligned teams (product or feature teams) to deliver work with greater speed and reduced complexity. They achieve this directive by abstracti...

Jun 11, 2024•24 min

#45 How Team Topologies Can Guide Enabling Teams

I got the inside word from Manuel Pais, co-author of the seminal Team Topologies book, to explain in a 2-part series about 2 of the most relevant team topologies for reliability work. In this first part, we will talk about enabling teams. A quick refresher on what enabling teams do In the team topologies context: Enabling teams help stream-aligned teams (product or feature teams) to overcome obstacles and improve their capabilities in specific areas. This kind of team is available to provide exp...

Jun 04, 2024•25 min

#44 - Making SLOs Matter to Stakeholders

Bonus episode on SLOs because Sebastian and I felt that we did not cover the why of SLOs and make them relevant to stakeholders. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

May 30, 2024•20 min

#43 - SLOs: a Deeper Dive into its Mechanics

This episode continues our coverage of Chapter 4 of the Site Reliability Engineering book (2016). In this second part, we take a deeper dive into the mechanics of SLOs. Here are 5 takeaways from the show: * Start Small with SLOs : Begin with a limited number of SLOs and iteratively refine them based on experience and feedback. Avoid overwhelming teams with too many objectives at once. * Defend and Enforce SLOs : Ensure that selected SLOs have real consequences attached to them. If conversations ...

May 28, 2024•32 min

#42 - Hitting Software SLA Targets through SLOs and SLIs

In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the Site Reliability Engineering book (2016). Here are 7 takeaways from the show: * Involve Technical Stakeholders Early : Ensure that technical stakeholders, such as SREs, are involved in discussions about SLAs and SLOs from the beginning. Their expertise can help ensure that objectives are feasible and aligned with the technical capabilities of the se...

May 21, 2024•29 min

#41 Curbing High Observability Costs

No one wants to get Coinbase’s $65 million observability bill in the future. Sure, observability comes with a necessary cost. But that cost cannot exceed the concrete and perceived value on balance sheets and the minds of leaders. Sofia Fosdick shares practical insights on curbing high observability costs. She’s a senior account executive at Honeycomb.io and has held similar titles at Turbunomic, Dynatrace, and Grafana. Like always, this is not a sponsored episode! We tackled the cost issue by c...

May 14, 2024•25 min

#40 How to Enable Observability for Success

Observability is more than a set of technologies. It’s a practice. Timothy Mahoney is no stranger to this practice, enabling many developer teams to take on better practices in observability. He’s a senior systems engineer at IKEA and is part of its observability enabling team. Tim highlighted the importance of developing and driving frameworks for observability. He also covered the antipattern of teams having a tool-driven mindset and the challenges of switching them out of this. You can ⁠ conn...

May 07, 2024•28 min

#39 How Chaos Engineering Helps Reduce Incident Risk

Chaos Engineering is no longer a nice to have, as Ananth Movva explains in this episode of the SREpath podcast. His experiences with it drove a reduced number and severity of serious incidents and outages. He’s been at the helm of reliability-focused decision-making at one of Canada’s largest banks, BMO, since 2020. Having completed 12 years at the bank, Ananth has seen the evolution of banking technology from archaic to user-centric, where incidents are considered seriously. Ananth highlighted ...

Apr 30, 2024•25 min

#38 The Real Cost of Software Reliability & Downtime

This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this second part, we talk about the costs behind reliability and choosing not to do it well or at all. Here are key takeaways from our conversation: * Prioritize Risk Mitigation : Recognize SRE as a discipline focused on mitigating risks within your organization, including technology, reputation, and financial risks. Allocate resources accordingly to address these risks proactively. * Consider Cost-Effectiveness : ...

Apr 23, 2024•24 min

For the best experience, listen in Metacast app for iOS or Android