I know it’s already six months into 2025, but we recorded this almost three months ago. I’ve been busy with my foray into the world of tech consulting and training —and, well, editing these podcast episodes takes time and care. This episode was prompted by the 2025 Catchpoint SRE Report , which dropped some damning but all-too-familiar findings: * 53% of orgs still define reliability as uptime only , ignoring degraded experience and hidden toil * Manual effort is creeping back in , reversing fiv...
Jul 01, 2025•30 min
Most teams talk about reliability with a margin for error. “What’s our SLO? What’s our budget for failure?” But in the energy sector? There is no acceptable downtime. Not even a little. In this episode, I talk with Wade Harris, Director of FAST Engineering in Australia , who’s spent 15+ years designing and rolling out monitoring and control systems for critical energy infrastructure like power stations, solar farms, SCADA networks, you name it. What makes this episode different is that Wade isn’...
Jun 17, 2025•28 min
Exploring how to manage observability tool sprawl, reduce costs, and leverage AI to make smarter, data-driven decisions. It's been a hot minute since the last episode of the Reliability Enablers podcast. Sebastian and I have been working on a few things in our realms. On a personal and work front, I’ve been to over 25 cities in the last 3 months and need a breather. Meanwhile, listen to this interesting vendor, Ruchir Jha from Cardinal, working on the cutting edge of o11y to help reduce costs fr...
Jan 28, 2025•21 min
Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he’s vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short. * Career Journey and Current Role : Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to...
Nov 12, 2024•29 min
Andrew Fong’s take on engineering cuts through the usual role labels, urging teams to start with the problem they’re solving instead of locking into rigid job titles. He sees reliability, inclusivity, and efficiency as the real drivers of good engineering. In his view, SRE is all about keeping systems reliable and healthy, while platform engineering is geared toward speed, developer enablement, and keeping costs in check. It’s a values-first, practical approach to tackling tough challenges that ...
Nov 05, 2024•36 min
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Oct 22, 2024•38 min
Here’s what we covered: Defining Platform Engineering * Platform engineering : Building compelling internal products to help teams reuse capabilities with less coordination. * Cloud computing connection : Enterprises can now compose platforms from cloud services, creating mature, internal products for all engineering personas. Ankit’s career journey * Didn't choose platform engineering; it found him. * Early start in programming (since age 11). * Transitioned from a product engineer mindset to b...
Oct 01, 2024•31 min
Why many copy Google’s monitoring team setup * Google’s Influence. Google played a key role in defining the concept of software reliability. * Success in Reliability. Few can dispute Google’s ability to ensure high levels of reliability and its ability to share useful ways to improve it in other settings BUT there’s a problem: * It’s not always replicable. While Google's practices are admired, they may not be a perfect fit for every team. What is Google’s monitoring approach within teams? Here’s...
Sep 24, 2024•8 min
Monitoring in the software engineering world continues to grapple with poor signal-to-noise ratios. It’s a challenge that’s been around since the beginning of software development and will persist for years to come. The core issue is the overwhelming noise from non-essential data, which floods systems with useless alerts. This interrupts workflows, affects personal time, and even disrupts sleep. Sebastian dove into this problem, highlighting that the issue isn't just about having meaningless pag...
Sep 17, 2024•8 min
The question then condenses down to: Can technical leads support reliability work? Yes, they can! Anemari has been a technical lead for years — even spending a few years doing that at the coveted consultancy, Thoughtworks — and now coaches others. She and I discussed the link between this role and software reliability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com...
Sep 10, 2024•32 min
We're already well into 2024 and it’s sad that people still have enough fuel to complain about various aspects of their engineering life. DORA seems to be turning into one of those problem areas. Not at every organization, but some places are turning it into a case of “hitting metrics” without caring for the underlying capabilities and conversations. Nathen Harvey is no stranger to this problem. He used to talk a lot about SRE at Google as a developer advocate. Then, he became the lead advocate ...
Sep 04, 2024•27 min
We’ll explore 3 use cases for monitoring data. They are: * Analyzing long-term trends * Comparing over time or experiment groups * Conducting ad hoc retrospective analysis Analyzing long-term trends You can ask yourself a couple of simple questions as a starting point: * How big is my database? * How fast is the database growing? * How quickly is my user count growing? As you get comfortable with analyzing data for the simpler questions, you can start to analyze trends for less straightforward q...
Aug 27, 2024•11 min
Shlomo Bielak is the Head of Engineering (Operational Excellence and Cloud) at Penn Interactive, an interactive gaming company. He’s dedicated much of his talk time at DevOps events to talk about a topic less covered at such technical events. A lot of what he said alluded to ways to become a more valuable engineer. I’ve broken them down into the following areas: * Avoid the heroic efforts * Mind + heart > Mind alone * Curiosity > Credentials * Experience > Certifications * Thinking for ...
Aug 20, 2024•37 min
Incident response is an increasingly difficult area for organizations. Many teams end up paying a lot of money for incident management solutions. However, issues remain because processes supporting the incident response are not robust. Incident response software alone isn't going to fix bad incident processes. It's gonna help for sure. You need these incident management tools to manage the data and communications within the incident. But you also need to have effective processes and human-techno...
Aug 15, 2024•10 min
According to Vlad Ukis, there are a lot of enterprises around whose IT functions are organized around ITIL. What you use SRE for is something completely different. SRE is not for setting up the IT function. It is for enabling the product organization to operate online services reliably at scale. However, the problem is that many in the industry are NOT using SRE principles but instead handing over complex services to a more traditional IT function. Dr. Vladislav Ukis is well qualified to talk ab...
Aug 13, 2024•29 min
Sonja Blignaut is a complexity expert. That might not sound relevant to incident response in reliability engineering. But it is! Our systems are becoming more complex and so are the resulting incidents . Learning about complexity can help reliability folk go into an incident with less anxiety, which we’ll explore in this episode. We'll explore the causes of complexity in incidents and how the Cynefin framework classifies incidents. We'll also deep dive into the concept of complexity itself and d...
Aug 06, 2024•37 min
Have you got complete monitoring of your software in effect? Are you sure? Google's SREs break monitoring down to white box versus black box monitoring. It's not the same as internal versus external monitoring, which we'll explore further. We'll cover topics like: - (quickly) What is monitoring? - What is whitebox monitoring? - What is black box monitoring? - The rising importance of blackbox monitoring This is a concept from Chapter 6 (Monitoring Distributed Systems) of the Google SRE (2016) bo...
Jul 30, 2024•10 min
Jack Neely is a DevOps observability architect at Palo Alto Networks and has a few interesting ways of extracting value from o11y data. We crammed into just under 25 minutes ideas like these 7 takeaways: * Reasserting the Need to Monitor Four Golden Signals : Focus on latency, traffic, errors, and saturation for effective system monitoring and management. * Prioritize Customer Health : in Jack’s words, the 5th golden signal. Go beyond traditional metrics to monitor the health of your customers f...
Jul 09, 2024•25 min
Alert noise is no joke and neither is the fatigue that results from it. I spoke with Dan Ravenstone who gave a talk at Monitorama about this very topic. He also happens to be an avid skateboarder! Here are 9 takeaways from our conversation: * Regularly Review and Update Monitoring Systems : Don’t set up monitoring once and forget about it. Continuously assess and update your monitoring systems to ensure they remain relevant and effective. * Focus on Relevant Alerts : Ensure your alerting system ...
Jul 02, 2024•30 min
Sebastian and I scoured Chapter 5 of the Site Reliability Engineering (2016) book to find nuggets of wisdom on how to reduce toil. We hit the jackpot with concepts like: * what is toil according to a 5-point criteria * why even care about toil? * where you can find toil in your software system * Google’s goal for how much work (%) should be toil * the fact that toil isn’t always all that bad Don’t have time to listen to what we learned or added to the concepts? Check out the takeaways toward the...
Jun 25, 2024•44 min
The common refrain after an incident is “We could and should learn from this” . To me, that alludes to the need for a robust learning culture . We might think we already have a good learning culture because we talk about problems and deep-dive them into retrospectives. But how often do we explore the nuances of how we are learning? Sorrel Harriet is an expert in supporting software engineering teams to develop a stronger learning culture. She was a “Continuous Learning Lead” at Armakuni (softwar...
Jun 18, 2024•29 min
I continue my conversation with Manuel Pais, co-author of the seminal Team Topologies book about team topologies suitable for reliability teams. In this second part, we will talk about platform teams. A quick refresher on what platform teams do In the team topologies context: Platform teams provide a curated set of self-service capabilities to enable stream-aligned teams (product or feature teams) to deliver work with greater speed and reduced complexity. They achieve this directive by abstracti...
Jun 11, 2024•24 min
I got the inside word from Manuel Pais, co-author of the seminal Team Topologies book, to explain in a 2-part series about 2 of the most relevant team topologies for reliability work. In this first part, we will talk about enabling teams. A quick refresher on what enabling teams do In the team topologies context: Enabling teams help stream-aligned teams (product or feature teams) to overcome obstacles and improve their capabilities in specific areas. This kind of team is available to provide exp...
Jun 04, 2024•25 min
Bonus episode on SLOs because Sebastian and I felt that we did not cover the why of SLOs and make them relevant to stakeholders. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
May 30, 2024•20 min
This episode continues our coverage of Chapter 4 of the Site Reliability Engineering book (2016). In this second part, we take a deeper dive into the mechanics of SLOs. Here are 5 takeaways from the show: * Start Small with SLOs : Begin with a limited number of SLOs and iteratively refine them based on experience and feedback. Avoid overwhelming teams with too many objectives at once. * Defend and Enforce SLOs : Ensure that selected SLOs have real consequences attached to them. If conversations ...
May 28, 2024•32 min
In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the Site Reliability Engineering book (2016). Here are 7 takeaways from the show: * Involve Technical Stakeholders Early : Ensure that technical stakeholders, such as SREs, are involved in discussions about SLAs and SLOs from the beginning. Their expertise can help ensure that objectives are feasible and aligned with the technical capabilities of the se...
May 21, 2024•29 min
No one wants to get Coinbase’s $65 million observability bill in the future. Sure, observability comes with a necessary cost. But that cost cannot exceed the concrete and perceived value on balance sheets and the minds of leaders. Sofia Fosdick shares practical insights on curbing high observability costs. She’s a senior account executive at Honeycomb.io and has held similar titles at Turbunomic, Dynatrace, and Grafana. Like always, this is not a sponsored episode! We tackled the cost issue by c...
May 14, 2024•25 min
Observability is more than a set of technologies. It’s a practice. Timothy Mahoney is no stranger to this practice, enabling many developer teams to take on better practices in observability. He’s a senior systems engineer at IKEA and is part of its observability enabling team. Tim highlighted the importance of developing and driving frameworks for observability. He also covered the antipattern of teams having a tool-driven mindset and the challenges of switching them out of this. You can conn...
May 07, 2024•28 min
Chaos Engineering is no longer a nice to have, as Ananth Movva explains in this episode of the SREpath podcast. His experiences with it drove a reduced number and severity of serious incidents and outages. He’s been at the helm of reliability-focused decision-making at one of Canada’s largest banks, BMO, since 2020. Having completed 12 years at the bank, Ananth has seen the evolution of banking technology from archaic to user-centric, where incidents are considered seriously. Ananth highlighted ...
Apr 30, 2024•25 min
This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this second part, we talk about the costs behind reliability and choosing not to do it well or at all. Here are key takeaways from our conversation: * Prioritize Risk Mitigation : Recognize SRE as a discipline focused on mitigating risks within your organization, including technology, reputation, and financial risks. Allocate resources accordingly to address these risks proactively. * Consider Cost-Effectiveness : ...
Apr 23, 2024•24 min