Send us a text In this episode we discuss uplifting telemetry knowledge within engineering teams to enrich their work (and their lives) with Paige Cruz from Chronosphere. We cover why not to take a chainsaw to your observability in order to cut costs, the dark side of auto-instrumentation, story telling with live data, and much more. The book that Paige recommends at the end is "Effecting Monitoring and Alerting for Web Operations": https://www.oreilly.com/library/view/effective-monitoring-and/9...
Feb 28, 2023•49 min•Season 2Ep. 45
Send us a text In this episode we discuss cognitive overload in SRE with Paige Cruz from Chronosphere. We cover both what cognitive load is, what causes it, as well as some potential antidotes and preventative measures. You can check out Chronosphere here: https://chronosphere.io/ You can find Paige on LinkedIn: https://www.linkedin.com/in/paigerduty/ You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.lin...
Feb 21, 2023•39 min•Season 2Ep. 44
Send us a text In this episode I discuss my "bigger picture" perspective of what observability needs to be, and why it's important we include business and customer into what we monitor in the Digital Era. The books I highlight in this episode are... Observability Engineering https://www.oreilly.com/library/view/observability-engineering/9781492076438/ Sooner, Safer, Happier: https://soonersaferhappier.com/book/ The Phoenix Project https://www.oreilly.com/library/view/the-phoenix-project/97814571...
Feb 14, 2023•10 min•Season 2Ep. 43
Send us a text In this episode we speak to José Velez from Rely about reliability at scale, a top down approach to SLOs, the potential and limitations of AI and ML in operations, the question of service ownership, utilising the business criticality of services in how we monitor the underlying infrastructure, and much more. You can check out Rely at https://www.rely.io/ You can find José on LinkedIn: https://www.linkedin.com/in/josevelez-relyio/ You can find the official Slight Reliability podcas...
Feb 07, 2023•37 min•Season 2Ep. 42
Send us a text In this episode we speak to Ken Hamric about distributed tracing, leveraging tracing for better testing, and observability driven development. The tool that Henrik Rexed integrated with Tracetest was Kuberhealthy (https://www.cncf.io/projects/kuberhealthy/) and you can watch a video of him discussing it in combination with Tracetest here: https://youtu.be/PKQQEeeMYxg?t=2492 Ken also mentioned Charity Majors' writing about observability driven development: https://thenewstack.io/a-...
Jan 31, 2023•32 min•Season 2Ep. 41
Send us a text In this episode Stephen explores the pros and cons of centralising observability data. Is it a practical to stand up a complex and costly data storage and retrieval solution? Is there another way? You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre...
Jan 24, 2023•11 min•Season 2Ep. 40
Send us a text This week I am joined by Ana Margarita Medina and Adriana Villela, the hosts of the On-Call Me Maybe podcast, to discuss what we'd like to see for SRE in 2023. We talk about observability, SRE recruitment, what organisations need in place to set SRE up for success, and much more. You can find the On-Call Me Maybe podcast on most podcast platforms or go directly to the website here: https://oncallmemaybe.com/ Twitter: https://twitter.com/oncallmemaybe Mastodon: https://mastodon.soc...
Jan 17, 2023•42 min•Season 2Ep. 39
Send us a text To begin 2023 I share the books I read last year in my quest to be a better SRE. Here is a list of all the books mentioned during the episode: The Phoenix Project by Gene Kim, Kevin Behr, and George Spafford https://www.amazon.com/Phoenix-Project-DevOps-Helping-Business/dp/0988262592 Site Reliability Engineering (by Google) https://sre.google/sre-book/table-of-contents/ Sooner, Safer, Happier by Jonathon Smart https://soonersaferhappier.com/book/ The Toyota Way by Jeffrey Liker ht...
Jan 09, 2023•10 min•Season 2Ep. 38
Send us a text This week Henrik Rexed and Stephen Townshend discuss their New Year's resolutions for observability. They cover OpenTelemetry and a unified query language, continuous profiling, raw data analysis, instrumenting code, using distributed tracing as part of testing, and much more. Some of the tools or resources mentioned during the episode include: https://tracetest.io/ (distributed tracing for testing) https://github.com/open-telemetry/opamp-go (OTEL orchestration) https://ebpf.io/ (...
Dec 19, 2022•46 min•Season 2Ep. 37
Send us a text This week we talk to Steve Gill and Gwen Berry from IAG to discuss their experiences forming an SRE incubator team (starting SRE from scratch in a large enterprise). We discuss on-call, SLOs, single pane of glass, pivoting, chaos engineering, and much more. You can find Steve on LinkedIn: https://www.linkedin.com/in/stevegill239/ You can find Gwen on LinkedIn: https://www.linkedin.com/in/gwen-berry-56324418b/ You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshe...
Dec 12, 2022•28 min•Season 2Ep. 36
Send us a text This week I share the observations I made at AWS re:Invent relating to SRE work including the lack of SREs at the event, data warehouses for observability data, the use of topologies to understand complexity, FinOps, serverless, making sense of enormous amounts of data... and more. You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre...
Dec 05, 2022•16 min•Season 2Ep. 35
Send us a text This week I was at the AWS re:Invent conference in Las Vegas, so I took the opportunity to walk around the expo asking observability vendors what their perspective or definition of "observability" was (and reflected on that). You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre...
Nov 30, 2022•8 min•Season 2Ep. 34
Send us a text In this episode I explore the different kinds of SRE out there and the different needs they fill in the industry, and discuss some ethically dubious practices around hiring SREs. You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre Music from Uppbeat (free for Creators!). Intro: https://uppbeat.io/t/sensho/good-times License code: QBXDSEGNJZY9DDIC Outro: https://uppbeat.io/t/mountaineer/voyager License code: 5C0VMTUO...
Nov 21, 2022•14 min•Season 2Ep. 33
Send us a text In this episode I chat to Kyle Forster and Shea Stewart from RunWhen about the concept of "social reliability engineering" and how it could help SREs from organisations all over the world create an ecosystem of sharing and collaboration. You can find Kyle on LinkedIn: https://www.linkedin.com/in/kyforster/ You can find Shea on LinkedIn: https://www.linkedin.com/in/sheastewart/ To find out more about RunWhen: https://www.runwhen.com/ And an example of the "street map view" of a tec...
Nov 14, 2022•45 min•Season 2Ep. 32
Send us a text In this episode I reflect back on the very first episode of Slight Reliability "What the heck is SRE anyway?" and see if my perspective has changed since then. I also tackle the confusion about what SRE is and is not. Shout out to Sebastian Vietz (https://www.linkedin.com/in/sebastianvietz/) for his "Service Reliability Engineering" terminology and Richard Benwell (https://www.linkedin.com/in/richard-benwell-ab887b11/) for highlighting the way SRE offers a different value proposit...
Nov 07, 2022•10 min•Season 2Ep. 31
Send us a text In this episode I announce my new role as Developer Advocate (SRE) at SquaredUp, and what this means for the Slight Reliability podcast. You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre Music from Uppbeat (free for Creators!). Intro: https://uppbeat.io/t/sensho/good-times License code: QBXDSEGNJZY9DDIC Outro: https://uppbeat.io/t/mountaineer/voyager License code: 5C0VMTUOULFSRSTM...
Oct 31, 2022•7 min•Season 2Ep. 30
Send us a text In this episode I give a summary of the book Team Topologies by Matthew Skelton and Manual Pais (https://teamtopologies.com/book) and how this relates to implementing SRE practices. (POINT OF CORRECTION: One of the authors is "Matthew" Skelton, not "Michael") You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre Music from Uppbeat (free for Creators!). Intro: https://uppbeat.io/t/sensho/good-times License code: QBXDSE...
Oct 24, 2022•18 min•Season 2Ep. 29
Send us a text In this episode I give my take on the Accelerate State of DevOps 2022 from the SRE perspective. You can find the Accelerate State of DevOps Report 2022 here: https://cloud.google.com/devops/state-of-devops/ You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre Music from Uppbeat (free for Creators!). Intro: https://uppbeat.io/t/sensho/good-times License code: QBXDSEGNJZY9DDIC Outro: https://uppbeat.io/t/mountaineer/vo...
Oct 17, 2022•13 min•Season 2Ep. 28
Send us a text In this episode I share my experience relapsing into anxiety and insomnia, ruminate on an SRE's sphere of influence, and tease an upcoming change of role. You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre Music from Uppbeat (free for Creators!). Intro: https://uppbeat.io/t/sensho/good-times License code: QBXDSEGNJZY9DDIC Outro: https://uppbeat.io/t/mountaineer/voyager License code: 5C0VMTUOULFSRSTM...
Oct 10, 2022•14 min•Season 2Ep. 27
Send us a text In this episode I reflect on the book "The Toyota Way" by Jeffrey Liker, and explore four principles which resonate with my work. The book in question is The Toyota Way: https://www.amazon.com/Toyota-Way-Second-Management-Manufacturer/dp/1260468518 You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre Music from Uppbeat (free for Creators!). Intro: https://uppbeat.io/t/sensho/good-times License code: QBXDSEGNJZY9DDIC ...
Sep 26, 2022•19 min•Season 2Ep. 26
Send us a text In this episode I discuss the concept behind continuous delivery and share the ideas we've been exploring at IAG. The book I mentioned is The Toyota Way: https://www.amazon.com/Toyota-Way-Second-Management-Manufacturer/dp/1260468518 You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre Music from Uppbeat (free for Creators!). Intro: https://uppbeat.io/t/sensho/good-times License code: QBXDSEGNJZY9DDIC Outro: https://u...
Sep 19, 2022•9 min•Season 2Ep. 25
Send us a text In this episode I have a chat with Bangser about the transition from testing to SRE, the barriers thrown in front of testers (which SREs don't tend to face), being humble to be let in the door, and *much* more. You can find Abby on LinkedIn: https://www.linkedin.com/in/abbybangser/ The book she mentioned was Infrastructure as Code by Kief Morris https://www.thoughtworks.com/insights/books/infrastructure-as-code-2nd-edition You can find Chastity Majors (cofounder of Honeycomb) on T...
Sep 12, 2022•29 min•Season 2Ep. 24
Send us a text In this episode I share the story of Grafana Central, an observability platform that we've been standing up at IAG. You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre Music from Uppbeat (free for Creators!). Intro: https://uppbeat.io/t/sensho/good-times License code: QBXDSEGNJZY9DDIC Outro: https://uppbeat.io/t/mountaineer/voyager License code: 5C0VMTUOULFSRSTM...
Sep 05, 2022•19 min•Season 2Ep. 23
Send us a text In this episode I share a talk I did earlier in the year as part of the Grafana User Group APAC. I share our experiences attempting to implement SLOs at IAG, and our reliability benchmarking work which is a great way to get started if SRE is brand new to your organisation. You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre Music from Uppbeat (free for Creators!). Intro: https://uppbeat.io/t/sensho/good-times Licens...
Aug 29, 2022•19 min•Season 2Ep. 22
Send us a text In this episode I share experiences and ideas about Kubernetes, and what I learned from speaking to Ruben Hakopiean from Kubevious. I'd like to give a huge shout out to Ruben. Many of the topics and ideas discussed come straight from what was discussed in the interview we recorded (but were unable to publish due to audio issues). You can find Ruben on LinkedIn: https://www.linkedin.com/in/rubenhak/ And find out more about Kubevious here: https://kubevious.io/ You can find me on: L...
Aug 22, 2022•12 min•Season 2Ep. 21
Send us a text In this episode I have a chat with Joey Hendricks about running performance tests in production. You can find Joey on LinkedIn: https://www.linkedin.com/in/joey-hendricks/ And GitHub: https://github.com/JoeyHendricks You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre Music from Uppbeat (free for Creators!). Intro: https://uppbeat.io/t/sensho/good-times License code: QBXDSEGNJZY9DDIC Outro: https://uppbeat.io/t/moun...
Aug 15, 2022•32 min•Season 2Ep. 20
Send us a text In this episode I share my takeaways from the NZ DevOps Summit held in Auckland. This was the first in-person event I had attended in three years. You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre Music from Uppbeat (free for Creators!). Intro: https://uppbeat.io/t/sensho/good-times License code: QBXDSEGNJZY9DDIC Outro: https://uppbeat.io/t/mountaineer/voyager License code: 5C0VMTUOULFSRSTM...
Aug 08, 2022•12 min•Season 2Ep. 19
Send us a text In this episode I have a chat with Chris Evans from incident.io about using incidents to lift the lid on an organisation, how aiming for zero incidents can stall an organisation, how tracking MTTR is unhelpful, and much more. You can find Chris on LinkedIn: https://www.linkedin.com/in/evnsio/ Here are the resources Chris mentioned... The practical guide to incident management: http://incident.io/guide The Field Guide to Understanding Human Error (by Sidney Dekker) https://www.orei...
Aug 01, 2022•32 min•Season 2Ep. 18
Send us a text In this episode I have a chat with Ganesh Datta, CTO and co-founder of Cortex.io. In this episode we discuss the human challenges of microservices, gamifying reliability, connecting business outcomes with SRE work, and much more. You can find Ganesh on LinkedIn: https://www.linkedin.com/in/gsdatta/ You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre Music from Uppbeat (free for Creators!). Intro: https://uppbeat.io/...
Jul 18, 2022•27 min•Season 2Ep. 17
Send us a text In this episode I have a chat with Sebastian Vietz, an SRE lead based in Canada who has been leading the implementation of SRE across different teams and organisations for eight years. In this episode we discuss SLO adoption, SRE going mainstream, virtual teams, and many other topics. You can find Sebastian on LinkedIn: https://www.linkedin.com/in/sebastianvietz/ You can find me on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre M...
Jul 11, 2022•41 min•Season 2Ep. 16