Reliability Enablers - podcast cover

Reliability Enablers

Ash Patel & Sebastian Vietzread.srepath.com
Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.

read.srepath.com
Last refreshed:
Follow this podcast in the Metacast mobile app to refresh it and see new episodes.
Download Metacast podcast app
Podcasts are better in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

#37 An SRE Approach to Managing Technology Risk

This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this first part, we talk about embracing risk from the SRE perspective. We'll cover how it's very different to the typical IT risk management mindset. Here are key takeaways from our conversation: Embrace Risk with Velocity : Rather than being hindered by traditional governance models and change approval boards, consider embracing risk while maintaining development velocity. Strive to find a balance between risk ma...

Apr 16, 202430 min

#36 Avoiding Critical Platform Engineering Mistakes

Platform engineering is replacing SRE and DevOps. Jokes aside, knowing the path to better platforms is key. Abby Bangser is the right person to tell us how to achieve greater maturity in this aspect of software operations. She's previously held SRE roles and currently works as Principal Engineer at Syntasso, the company behind the popular Kratix platform framework. Abby highlighted the need for concrete definitions and maturity models in platform engineering trends, cautioning against equating d...

Apr 09, 202427 min

#35 Boosting Your Observability Data's Usability

The observability (o11y) data revolution is well underway, but are we getting the most from the data that is being collected? Richard Benwell thinks we have room for improvement, especially at the usage stage where we query and visualize the o11y data. He is the founder and CEO of SquaredUp, a dashboard software company based out of Maidenhead, UK with over 10 years of experience in the monitoring space. Richard highlighted the importance of converging human intuition with technical o11y impleme...

Apr 02, 202435 min

#34 From Cloud to Concrete: Should You Return to On-Prem?

This episode continues our coverage of Chapter 2 of the Site Reliability Engineering book (2016). We talk about the age-old debate of cloud vs on-prem, which is analogous to that other debate we have in the technology of build vs buy. Here are key takeaways from our conversation: Adapt your storage solutions to business needs: Understand the diverse storage options available and tailor them to specific business needs, considering factors like data type, access patterns, and scalability requireme...

Mar 26, 202423 min

#33 Inside Google's Data Center Design

This episode covers Chapter 2 of the Site Reliability Engineering book (2016). In this first part, we talk about the intricacies of data center design outlined in the book. One thing is for sure. Building a data center for your own needs is HARD work with many considerations you must make. Here are key takeaways from our conversation: Importance of understanding data center fundamentals : Even if you're not operating at the scale of companies like Google, understanding the fundamentals behind da...

Mar 19, 202423 min

#32 Clarifying Platform Engineering's Role (with Ajay Chankramath) BONUS EP

Will Platform Engineering replace DevOps or SRE or both? I don’t think this is the case at all. Neither does Ajay Chankramath. He is the Head of Platform Engineering at ThoughtWorks North America, an innovator consulting group. I’d take his word for it since he’s held senior leadership roles in release engineering and more since 2002. In this bonus episode of the SREpath podcast, Ajay shared his perspective on the debate about SRE vs DevOps vs Platform Engineering. This is a public episode. If y...

Mar 14, 202417 min

#31 Introduction to FinOps (with Ajay Chankramath)

FinOps is on the tip of many tongues in the software space right now, as we try to curb our cloud costs. Ajay Chankramath has given talks on FinOps at conferences like the DevOps Enterprise Summit (DOES) among others. He is the Head of Platform Engineering at ThoughtWorks North America, an innovator consulting group. His peers like Martin Fowler and Neal Ford have originated ideas like refactoring, microservices, and more. He shared practical advice for avoiding a harsh, restrictive cost control...

Mar 12, 202427 min

#30 Clearing Delusions in Observability (with David Caudill)

Observability is going through interesting times. David Caudill believes that delusions are getting in the way of our success in this area. He's a senior engineering manager at Capital One, a US-based bank. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Mar 07, 202437 min

#29 - Reacting to Google's SRE book 2016 (Chapter 1 Part 2)

Sebastian and I continue our breakdown of notable passages from Chapter 1 of Google's Site Reliability Engineering (2016) book by Betsy Beyer, Jennifer Pettof, Niall Murphy, et al. We covered passages like: Monitoring is one of the primary means by which service owners keep track of a system's health and availability. Efficient use of resources is important anytime a service cares about money. Humans add latency, even if a given system experiences more actual failures. A system that can avoid em...

Feb 27, 202431 min

#28 - Reacting to Google's SRE Book 2016 (Chapter 1 Part 1)

Sebastian and I got together to react to and discuss 5 passages from Chapter 1 of Google's Site Reliability Engineering book (2016) by Betsy Beyer, Jennifer Pettof, Niall Murphy, et al. We covered passages like: The sysadmin approach and the accompanying development ops split have a number of disadvantages and pitfalls Google has chosen to run our systems with a different approach. Our Site Reliability Engineering teams focus on hiring software engineers to run our products The term DevOps emerg...

Feb 20, 202426 min

#27 - Growing as a Site Reliability Engineer (Part 3)

Third and final instalment of the Growing as an SRE series covering practical ideas for planning your career progression This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Feb 13, 202416 min

#26 - Growing as a Site Reliability Engineer (Part 2)

In part 1, we covered the first truth - that you don't grow in your career merely through tenure. That was a simple one. Let's explore 2 more truths that are somewhat trickier... Background music credit: Luna by KaizanBlue This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Feb 08, 202419 min

#25 - DORA and the Pursuit of Engineering Excellence (with Tim Wheeler)

DORA metrics are a hot topic among technology executives in all kinds of enterprise. But there's more to engineering culture than solely relying on the numbers it goes you. We have a rare treat for you because Ash got Tim Wheeler on the pod. He doesn't do much of social media or podcast episodes. Tim is Director of Engineering Excellence at SquaredUp where he follows the DORA metrics but emphasizes starting conversations around them rather than setting directives. This is a public episode. If yo...

Jan 30, 202438 min

#24 - Growing as a Site Reliability Engineer (Part 1)

How can you grow as an SRE? You've probably thought about your career progression at some point. Ash put together his initial thoughts on this topic. Listen on to learn how he unpacks the first idea of "You don't get promotions with tenure". This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Jan 23, 20249 min

#23 - The Danger of Unreliable Platforms (with Jade Rubick)

Jade Rubick needs no introduction in the reliability and observability space. He was VP of Engineering at New Relic from 2010 to 2019. It was my pleasure to take on his non-obvious ideas on managing expectations with teams, especially platform-based teams. We had a few spicy ideas to dive into. We also touched on topics like enhancing engineering practices, DORA metrics, and so much more. Be sure to listen all the way through to learn Jade's amazing insights. This is a public episode. If you wou...

Jan 16, 202429 min

#22 - How Google does SRE Consulting (with Yury Niño Roa)

I did not know that Google itself does consulting around its SRE practices. This is not a sponsored episode LOL! I wanted to talk with my SRE friend, Yury Niño Roa, about her drawings and SRE ideas, but we dove into a whole lot more than that. We spoke about her work at Google's PSO office, the antipatterns she's seen, and a whole lot more. Listen in for an engaging conversation. You can follow Yury and her amazing drawings via: https://www.linkedin.com/in/yurynino/ This is a public episode. If ...

Jan 09, 202436 min

#21 - Better SRE in 2024 is all we can hope for

Sebastian is back for this episode to help set out direction for 2024. We reflected during the holidays on the problems SREs faced in 2023 in terms of job insecurity, burnout, and "that really shouldn't be my sole job". Sebastian and I talked about what we hope to bring to the community in 2024 to make SREs and SRE teams stronger, happier, and healthier at their work. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.sr...

Jan 02, 202432 min

#20 Holiday Special with Stephen Townshend

Join Ash Patel and Stephen Townshend for a friendly chat about what they've learned in SRE as 2023 comes toward a wrap! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Dec 19, 202330 min

#19 How to Develop Early Career Engineers (with John Hyland)

Ash Patel talks with John Hyland who ran the Ignite Program at New Relic, which is dedicated to developing early career engineers. John shares insights about driving better outcomes for the organization and the early career professionals who join them. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com...

Dec 12, 202341 min

#18 Winning at SRE in Banking and Telecom (with Troy Koss)

Ash Patel talks with Troy Koss who is the Director of SRE at CapitalOne, an early adopter of DevOps and SRE in the banking sector. He shares insights on working in regulated industries like banking telecom with his early work experience being at Verizon, a US telecom. Troy shares his thoughts on building stronger SRE individual contributors and emphasizes the importance of education as pivotal to ongoing reliability success. This is a public episode. If you would like to discuss this with other ...

Dec 05, 202335 min

#17 Lessons from SRE's Wild West Days (with Rick Boone)

Ash Patel talks with Rick Boone who is a pioneer in SRE, having been an early AppOps engineer at Facebook and Uber's first SRE hire. He shares amazing stories from those pioneering days. Rick also draws from his experience to share his insights on how to build stronger SRE teams, as well as support effective career progression for individual contributor SREs. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com...

Nov 27, 202346 min

#16 Acing Cloud Infra in Digital Media Giant (with Sreejith Chelanchery)

Ash Patel interviews Sreejith Chelchery who is SVP of Delivery and Infrastructure Engineering at Dotdash Meredith. Sreejith shares his journey from programming analyst in Bangalore, India, to now being an executive responsible for platform engineering, DevOps, and SRE at a media giant in New York City. He gives a glimpse into how his team saved his organization over $9 million in cloud computing costs, how they started an internal developer platform well before Backstage was around, and more. Sr...

Nov 21, 202339 min

#15 Growing Reliability Engineering Across 5+ Companies (with Nash Seshan)

Ash Patel talks with Nash Seshan, who has supported reliability work in over 5 organizations, including Cisco, eBay, Dropbox, Lyft, Netflix, and Wayfair. He shares his learnings from reliability work at these big brands. Nash also draws from his experience as co-founder of a Y Combinator-funded startup on effective engineering leadership. He also gives his take on issues with ill-conceived automation. This is a public episode. If you would like to discuss this with other subscribers or get acces...

Nov 14, 202343 min

#14 Faster Incident Resolution through Data-Driven Notebooks (with Ivan Merrill)

Ash Patel talks with Ivan Merrill of Fiberplane about wrangling the big data that incidents and systems generate through collaborative notebooks. Ivan also touches on how open-source tools like Autometrics enable deeper observability of code by increasing the granularity of data used for incident response and retrospectives. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com...

Nov 07, 202342 min

#13 Making Sense of OpenTelemetry and Observability (with Adriana Villela)

Ash Patel talks with Adriana Villela (CNCF Ambassador, OpenTelemetry contributor, and senior developer advocate at Lightstep) about the promise of OpenTelemetry for observability teams, as well as the challenges of doing it right. She also touches on engineering leadership topics, recalling her experience as a leader of platform engineering and observability teams. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepa...

Oct 31, 202333 min

#12 From Incident Firefighting to Reliability First (with Robert Ross)

Ash Patel talks with Robert Ross of Firehydrant about his experience in offering incident management software to SREs and other software incident responders. Highlights include defining the broader concept of reliability, making smarter choices for handling incidents, and more. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com...

Oct 24, 202329 min

#11 Rising to Staff Engineer in DevOps and SRE (with Rajesh Reddy N)

Ash Patel interviews Rajesh Reddy N about his experiences as a senior DevOps and SRE individual contributor. Rajesh shares his insights on having systems to minimize alert fatigue, the importance of security in DevOps, and more. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

Oct 17, 202327 min

#10 Using AI for Kubernetes troubleshooting self-service (with Kyle Forster)

Ash Patel interviews Kyle Forster of RunWhen about his experiences as an ex-Google director helping SREs and running an AI-based company that supports Kubernetes troubleshooting. Their conversation will cover themes like enabling junior SREs, the role of SRE in shift-left, and handling misaligned incentive models in organizations. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com...

Oct 10, 202324 min

#9 Inside Booking.com's Site Reliability Engineering practice (with Samuele Tonon and Yoann Fouquet)

In this episode of the SREpath Podcast, Ash Patel interviews two SRE managers from Booking.com , Samuele and Yoann, to gain insights into their experiences and strategies for developing a successful SRE practice within a large organization. Yoann is a senior manager responsible for managing SRE teams and serves as the SRE Craft lead. Samuele is an SRE engineering manager working in the Big Data department and manages a team of eight to nine people. Yoann officially began his journey in SRE in 20...

Oct 02, 202329 min

#8 Software Reliability Ninja Who is NOT an SRE (with Pablo Bouzada)

Ash Patel interviews Pablo Bouzada about his beliefs on software reliability as a non-SRE leader. They discuss the importance of effective leadership to drive effective reliability changes in the software system, as well as the challenges of providing reliable service within video streaming giant, ViaPlay. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com...

Sep 11, 202323 min
For the best experience, listen in Metacast app for iOS or Android