Send us a text This week on Slight Reliability Stephen discusses observability vendor lock-in. What is it? What does OpenTelemetry do to help? What areas are yet to be solved? You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre YouTube: https://www.youtube.com/c/SlightReliability Instagram: https://www.instagram.com/slight_reliabil...
Oct 31, 2023•9 min•Season 2Ep. 74
Send us a text This week we sit down and talk about SLOs with CPO and co-founder of Nobl9 Brian Singer. We talk about the importance of reviewing operational effectiveness, getting buy in from leadership, using SLOs to reduce noise, how to implement SLOs within different cultures and structures, the parallels between security and reliability... and much more. You can check out Nobl9's reliability and SLO platform here: https://www.nobl9.com/ You can find Brian on LinkedIn: https://www.linkedin.c...
Oct 24, 2023•32 min•Season 2Ep. 73
Send us a text This week Stephen chats with Valeska Victoria about her time working as an SRE at eBay. Valeska shares her data driven approach to SRE, having a voice as a less experienced engineer, handling incidents under high pressure, leveraging large language models to rapidly find the information you need during an incident, and much more. You can check out PromptOps here: https://www.promptops.com/ You can find Valeska on LinkedIn: https://www.linkedin.com/in/valeska-victoria/ You can find...
Oct 17, 2023•42 min•Season 2Ep. 72
Send us a text This week Stephen chats with Dr. Vlad Ukis about his journey discovering, and then implementing SRE practices at Siemens Healthineers (which led to him writing a book). They discuss how the evolution of infrastructure necessitates a shift in how we operate, the power of selling SRE practices, the SRE infrastructure used to build SLOs and reliability capabilities, how he implemented SLOs, and much more. You can find Vlad's book "Establishing SRE Foundations" here: https://www.amazo...
Oct 10, 2023•29 min•Season 2Ep. 71
Send us a text Amin Astaneh (from Certo Modo) is back to discuss his experience working as a production engineer (SRE equivalent) at Meta. Stephen and Amin discuss what it's like interviewing for big tech, "you build it, you own it", different SRE engagement models, SRE at different sizes of organisation, socialising your SRE success as a way to get traction, and so much more. You can find Amin on his company website https://certomodo.io , LinkedIn: https://www.linkedin.com/in/aminastaneh/ and T...
Oct 03, 2023•42 min•Season 2Ep. 70
Send us a text This week Stephen talks to Praveen Kasam from Diconium Digital Solutions about how he led SRE transformations. Praveen shares his experience transitioning from development to SRE and how leveraging automation and bringing application knowledge to the ops team provided quick wins. He also covers how he later applied SRE concepts to uplift the wider organisation. If you are out there looking for advice on how to implement SRE in your organisation, this is the episode for you. You ca...
Sep 26, 2023•30 min•Season 2Ep. 69
Send us a text This week Stephen asks Eric Schabell (Director of Technical Marketing & Evangelism @ Chronosphere) about how dashboards fit into modern observability. They discuss how untamed observability can lead to unexpectedly high cloud bills, the similarities between dashboards and documentation, the "know > triage > understand" workflow, and much more. You can find Eric at: LinkedIn: https://www.linkedin.com/in/ericschabell/ X: https://twitter.com/ericschabell And you can find Ch...
Sep 19, 2023•33 min•Season 2Ep. 68
Send us a text This week Stephen chats with Jamie Allen (Cheif Technologist AWS & SRE @ EPAM Systems) and Adam Kinniburgh (VP Innovation @ SquaredUp) about the concept of a single pane of glass (SPOG) for SRE. Is it performance art or something actionable? Can alerting replace the need for dashboards? And are metrics drowning in the wake of distributed tracing? You can find Jamie at: LinkedIn: https://www.linkedin.com/in/jlallen/ And the Single Pain of Glass article he wrote here: https://me...
Sep 12, 2023•35 min•Season 2Ep. 67
Send us a text This week Stephen brings back Kyle Forster from RunWhen to talk about the purple elephant in the room… “AI”. What makes it GenAI, LLM, Advanced Statistics, or ML? Kyle shares his experience surrounding building AI powered search engines for SRE troubleshooting commands and how to incorporate a (paid) open source community of experts rather than trust AI by itself. They discuss what search looks like under the hood, why GenAI powered chatbots will or won't take over the SaaS indust...
Sep 05, 2023•30 min•Season 2Ep. 66
Send us a text This week Stephen chats with the internet incident librarian herself, Courtney Nash. They explore what Courtney has learned through meta-analysis of the over ten thousands incidents in the Verica Open Incident Database (VOID). They cover why MTTR needs to go in the garbage, joint cognitive systems, the value of looking at near misses and *much* more. You can check out the VOID here: https://www.thevoid.community/ The two papers mentioned are: Ironies of Automation by Lisanne Bainb...
Aug 29, 2023•41 min•Season 2Ep. 65
Send us a text This week Stephen chats with Martin Thwaites from Honeycomb about how developers can leverage observability to understand what they're building better, solve bugs quicker, and have more time for coding. They also discuss OpenTelemetry (the protocol and semantic conventions), manual versus automatic instrumentation, and how keeping every span of trace data is irresponsible. You can find Martin at: LinkedIn: https://www.linkedin.com/in/martin-thwaites-ab445120/ X: https://twitter.co...
Aug 22, 2023•36 min•Season 2Ep. 64
Send us a text Observability is a necessary adaptation to make sense of software systems in the Digital Age, but how can we unlock its power for non-engineer stakeholders (such as executives, product owners, etc)? Perhaps we need a layer of abstraction sitting on top of our detailed observability to get the most out of it. You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend...
Aug 15, 2023•9 min•Season 2Ep. 63
Send us a text This week Stephen chats with former-Google SRE Matt Brown about being on-call. They cover how to up-lift junior engineers so they can be on-call, what a fair on-call schedule looks like, run-books, and much more. As you heard, Matt believes flexibility is key to a healthy on-call rotation. Matt is exploring ideas for improvements to existing tooling and products in this space and would love to hear from as many listeners as possible with feedback on what they find useful or frustr...
Aug 01, 2023•37 min•Season 2Ep. 62
Send us a text The internet is full of people who want to tell you about SRE, DevOps, and Platform Engineering and how different and similar they are... and will give you the impression that these things compete with each other. But do they? And is it a helpful question to ask in the first place? You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter....
Jul 25, 2023•6 min•Season 2Ep. 61
Send us a text In this episode Amin Astaneh from Certo Modo discusses his experience undertaking an SRE transformation over several years. Stephen and Amin cover a lot of ground including making ops work visible, measuring toil, the power of calculating the $ value of work, getting developers on-call, the embedded model for SRE, SLOs, culture change, and a whole lot more. You can find Amin on his company website https://certomodo.io , LinkedIn: https://www.linkedin.com/in/aminastaneh/ and Twitte...
Jul 11, 2023•43 min•Season 2Ep. 60
Send us a text In this episode Stephen Townshend and Sonja Chevre from Tyk discuss making APIs observable, and some anti-patterns to avoid. They cover GraphQL, OpenTelemetry and semantic conventions, correlation IDs, observability pipelines, and much more. You can find Sonja on LinkedIn: https://www.linkedin.com/in/sonjachevre/ and Twitter: https://twitter.com/SonjaChevre You can listen to Sonja's KubeCon talk here: https://youtu.be/IkEUJjRBCbo You can find Tyk's open source gateway here: https:...
Jul 04, 2023•40 min•Season 2Ep. 59
Send us a text In this episode Stephen Townshend and Harinder Seera explore how to monitor and manage the cost of cloud. They discuss FinOps as a cultural practice, anti-patterns for implementing in the cloud, keeping cost down through resources, pricing, and architecture... and much more. You can find Harinder on LinkedIn: https://www.linkedin.com/in/harinderseera/ You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: ...
Jun 27, 2023•37 min•Season 2Ep. 58
Send us a text In this episode Stephen shares his experiences traveling overseas to the UK and Singapore AWS Summit, SREcon APAC, and the internal SquaredUp conference "SqUpCon". You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Slight Reliability artwork on Instagram: https://www.instagram.com/slight_reliability/ You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre In...
Jun 20, 2023•16 min•Season 2Ep. 57
Send us a text A quick update on Stephen's whereabouts and when the next episode will be released.
Jun 09, 2023•2 min
Send us a text In this episode Stephen discusses the role of dashboards within the context of the Digital Era. What are they *not* appropriate for? What can they help with? What kinds of things are suitable to present? If you want to get involved in the SquaredUp dashboard competition head along to: https://squaredup.com/blog/dashboard-competition/ (everyone who submits an entry gets a t-shirt, you can also win Star Wars Lego, get video interviewed by me, and have the story of your dashboard pre...
May 23, 2023•14 min•Season 2Ep. 56
Send us a text This week Bruce Cullen is back to share his experiences from KubeCon + CloudNativeCon 2023 Europe. We chat about OpenTelemetry, green engineering, securing your CI/CD pipeline and much more. Bruce is the Director of Engineering at SquaredUp. You can find him on LinkedIn: https://www.linkedin.com/in/bruce-cullen/ You can find the official Slight Reliability podcast website at: https://slightreliability.com/ If you like Slight Reliability's mspaint style artwork you can find more of...
May 16, 2023•40 min•Season 2Ep. 55
Send us a text In this episode Stephen Townshend chats to Andy Thurai (VP and Principal Analyst at Constellation Research) about Andy's latest report titled "Trends in Incident Management 2023". They chat about "mean time to innocence", status pages, they debate whether AI or ML has real value for incident management, and ponder why anyone would willingly decide to become an incident commander? You can find Andy's report here: https://www.constellationr.com/research/2023-trends-incident-manageme...
May 09, 2023•32 min•Season 2Ep. 54
Send us a text In this episode Stephen Townshend chats to Tim Wheeler (Director of Engineering Services at SquaredUp) about his work implementing and continually monitoring DORA metrics. They chat about customising each metric to your own unique context, avoiding the weaponisation metrics, the "tools will solve this for me" trap, and much more. The books mentioned during this episode were: Accelerate, The DevOps Handbook, The Phoenix Project, The Unicorn Project, Lean Enterprise, and Sooner, Saf...
May 02, 2023•28 min•Season 2Ep. 53
Send us a text In this episode Stephen explores the SRE concept of "toil". What is it? How can we measure it? How do we reduce it? Also in this episode: Can we make non-technology systems observable? (like we do technology ones), and the ineffectiveness of change advisory boards (CAB). Also, Stephen's upcoming attendance at SREcon, AWS Summit, and SLOconf. Shout outs to Steve McGhee, Dom Finn, and Shea Stewart. You can find the official Slight Reliability podcast website at: https://slightreliab...
Apr 25, 2023•9 min•Season 2Ep. 52
Send us a text In this episode Stephen Townshend and Anurag Gupta discuss the new reliability.org community for SREs or reliability engineers to share experiences, ask questions, and find community. They discuss the value of community and sharing your thoughts, collaboration between organisations, vicious versus virtuous cycles for reliability, and much more. You can join us in the community by visiting https://www.reliability.org/ You can find Anurag: On LinkedIn: https://www.linkedin.com/in/aw...
Apr 18, 2023•30 min•Season 2Ep. 51
Send us a text In this episode Bruce Cullen interviews Stephen Townshend about the past, present, and future of the Slight Reliability podcast. They discuss their shared backgrounds in software testing, the different career paths that testing has opened up, and much more! Bruce is the Director of Engineering at SquaredUp. You can find him on LinkedIn: https://www.linkedin.com/in/bruce-cullen/ You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can ...
Apr 11, 2023•39 min•Season 2Ep. 50
Send us a text In this episode Ivan Merrill from Fiberplane shares his experiences implementing observability within some of the large complex organisations he's worked for in the past. You can find Ivan on LinkedIn: https://www.linkedin.com/in/ivan-merrill-1a05223/ You can find out more about Fiberplane here: https://fiberplane.com/ You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.linkedin.com/in/steph...
Apr 04, 2023•39 min•Season 2Ep. 49
Send us a text In this episode I discuss the word "insight" within the context of observability. Is insight something tools can provide? Is it something you can reproduce? You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Twitter: https://twitter.com/the_kiwi_sre...
Mar 21, 2023•8 min•Season 2Ep. 48
Send us a text In this episode Stephen Townshend discusses our increased dependency on third party cloud services and what this means for reliability with Jeff Martens and Ryan Duffield from https://metrist.io/. You can find Jeff... On LinkedIn: https://www.linkedin.com/in/jmartens/ On Twitter: https://twitter.com/Jmartens You can find Ryan... On StackOverflow: https://stackoverflow.com/users/2696/ryan-duffield On GitHub: https://github.com/rduffield You can find the official Slight Reliability ...
Mar 14, 2023•33 min•Season 2Ep. 47
Send us a text In this episode I propose the use of scatterplots of raw data to better understand how our systems are behaviour and what our customers are experiencing. The ideas from this episode come from my time as a performance engineer and working with legends in that space Richard Leeke (https://www.linkedin.com/in/richard-leeke-450448/) and Neil Davies (https://www.linkedin.com/in/neildaviesnz/). For some basic examples of scatterplots and what they show you versus line charts check out a...
Mar 07, 2023•10 min•Season 2Ep. 46