As 2023 comes to a close, in the spirit of Dickens’ holiday classic “A Christmas Carol,” let’s reflect on the valuable insights left by the ghosts of network operations teams past, present, and yet to come. Tune in to hear host Mike Hicks (Principal Solutions Analyst at ThousandEyes) discuss lessons from the NetOps teams of the past, the current state of NetOps, and what the future might hold—all with the goal of helping teams take steps to optimize performance and deliver delightful digital exp...
Dec 21, 2023•22 min•Ep 79•Transcript available on Metacast Recent changes appeared to trigger a series of events for two peering points internationally—with very different impacts. Tune in to learn more about these incidents, why they differed, and the lessons they leave. Mike Hicks, Principal Solutions Analyst at ThousandEyes, will also cover the latest outage numbers and explore other recent incidents, including an Oracle Cloud outage and a duo of disruptions at Alibaba Cloud. Interested in more outage analysis? Check out our Internet Outages Timeline...
Dec 12, 2023•14 min•Ep 78•Transcript available on Metacast As companies gear up for Black Friday, The Internet Report team shares some best practices for delivering great customer experiences and minimizing downtime during one of the retail industry’s biggest days of the year. Mike Hicks, Principal Solutions Analyst at ThousandEyes, will cover some helpful case studies of Black Fridays that experienced some hiccups and what you can do to guard against similar disruptions. To learn more, check out the link below: - https://www.thousandeyes.com/blog/inter...
Nov 22, 2023•15 min•Ep 77•Transcript available on Metacast Backend-related incidents have been a recurring theme in outages across 2023, caused by everything from data center issues and hardware mishaps to failures at common (shared) services. Recently, we saw two examples of these backend issues when data center power problems led to outages at both Cloudflare and Workday. Tune in to hear more about what happened at Cloudflare and Workday, as well as our analysis of disruptions at OneLogin and GitLab. ——— CHAPTERS 00:00 Intro 01:00 OneLogin Disruption ...
Nov 14, 2023•32 min•Ep 76•Transcript available on Metacast This Halloween, The Internet Report team is sharing some of their most thrilling (and chilling) networking tales. Pull up a chair (and a big bowl of your favorite Halloween candy) to hear what happened—and important lessons learned. ——— CHAPTERS 00:00 Intro 01:40 Haunting obstacles with a dynamic routing protocol that thwarted crew changes on an oil platform 10:00 A spooky code base rollout that unleashed memory leak mischief 18:58 A chilling application rollout that failed to deliver on user ex...
Nov 01, 2023•44 min•Ep 75•Transcript available on Metacast In recent weeks, back-end infrastructure work and other backend-related issues impacted various online and consumer banking services, including DBS and Citibank in Singapore. Simple front-facing customer experiences that we’ve become accustomed to today can often mask considerable complexity on the backend. The service delivery chain of technologies powering the front end often comprises a mix of on-premises assets, cloud services, containers, and APIs. A degradation or outage to just one of tho...
Oct 30, 2023•24 min•Ep 74•Transcript available on Metacast Outages and degradations can happen when underlying data isn’t fresh enough. In recent weeks, stale data may have contributed to incidents at both Slack and Cloudflare. Slack began experiencing issues when, by our best guess, its app stopped trusting the freshness of the data in the cache; and, separately, Cloudflare’s 1.1.1.1 DNS resolver ran into some issues related to stale root zone data. Watch this Pulse Update episode to hear more about the Cloudflare and Slack outages, and also explore re...
Oct 17, 2023•31 min•Ep 73•Transcript available on Metacast Providing great digital experiences relies on a complex service delivery chain. The past few weeks brought multiple reminders that the root cause of cloud and app disruptions often comes down to one single link in this chain. While the component at issue may appear small, if it’s not functioning normally, the consequences can be significant. Additionally, the impact of a malfunctioning “link” is often intensified by a lack of understanding or visibility into the entire end-to-end service deliver...
Oct 02, 2023•22 min•Ep 72•Transcript available on Metacast In a world that operates at “hyperscale,” the potential for hyperscale-sized problems is also very real. The measure of a good provider—and a well-engineered system—is how well they handle these anomalous conditions and minimize disruption. During recent weeks, some of these hyperscale-sized outages hit, including data center-focused disruptions that impacted companies like Square, Oracle OCI, NetSuite, and Microsoft Azure. Tune into this Pulse Update episode to go under the hood of these outage...
Sep 15, 2023•33 min•Ep 71•Transcript available on Metacast An outage occurs, a change is rolled back, and everything stabilizes. But what happens when the change is attempted a second time? These second tries often go much more smoothly. While another outage might still occur during this “take two,” the impact is usually far less severe. The engineering team has learned from what went wrong the first time and is ready to stop at the first hint of trouble. Slack recently experienced a pair of disruptions that appear to illustrate this “take two” scenario...
Sep 02, 2023•21 min•Ep 70•Transcript available on Metacast Context matters when working on a distributed web-based application or service where everything is linked and dependent on each part functioning correctly. It’s all too easy for one team to make a change that unexpectedly affects something another team is working on. Or the combined impact of both changes may also accidentally break something. To avoid such mishaps, teams should cut back on silos as much as possible. However, it’s hard to completely eliminate siloed operations or decision-making...
Aug 21, 2023•34 min•Ep 69•Transcript available on Metacast In an end-to-end service delivery chain, isolated changes can have broad consequences. This played out recently when an erroneous SSL certificate change at Microsoft appeared to cause a SharePoint Online and OneDrive for Business outage. While this incident definitely underscores the importance of valid security certificates, it’s also a reminder of what can happen when even one component in an end-to-end service delivery chain experiences issues. Every component needs to work in sync to maintai...
Aug 05, 2023•27 min•Ep 68•Transcript available on Metacast Let’s face it. Not every contingency can be planned for. Sometimes an outlier scenario pops up and causes an unexpected outage or disruption. Over the past few weeks, multiple companies appeared to be impacted by such edge cases: Azure; GitLab; and Meta’s WhatsApp, Facebook, Instagram, and Threads—its newest addition. Tune into the latest Pulse Update episode to learn more about what happened during these disruptions and why robust visibility is so important for navigating unexpected outlier sce...
Jul 21, 2023•19 min•Ep 67•Transcript available on Metacast The application opens, but users encounter errors when they try to do anything—what gives? It’s the curious case of the disappearing backend. Discover why application issues often show up like this, with the service reachable but unresponsive beyond rendering a basic landing page, and sometimes an accompanying error message. In this episode, hosts Mike Hicks and Brian Tobia discuss this common problem and explore related incidents at CBA, GitHub, and Microsoft Teams. They also unpack other recen...
Jul 10, 2023•18 min•Ep 66•Transcript available on Metacast Though network outages are still far more common, application outages seem to be increasing in 2023—and having bigger impacts. Tune in to learn more about this trend and dive into incidents at Okta and Instagram. Host Mike Hicks will also explore other outage trends from the first half of the year in this special episode reflecting on the state of the Internet in 2023 thus far. To learn more, check out these links: - Internet Report: Pulse Update Blog: https://www.thousandeyes.com/blog/internet-...
Jun 28, 2023•22 min•Ep 65•Transcript available on Metacast For three consecutive years, there appears to have been a spike in outages and degradations in May. A potential “spring cleaning effect” may explain why. Tune in to learn more about this possible trend and explore what happened during recent incidents at Twitter; Microsoft 365; Slack; Instagram; Apple’s iMessage; and subscription-based streaming service, Max (formerly known as HBO Max). After watching, check out these links to dive deeper: Internet Report: Pulse Update Blog: https://www.thousand...
Jun 10, 2023•27 min•Ep 64•Transcript available on Metacast Tune in to explore ways that outages can impact distributed software development teams and what companies can learn from recent incidents at GitHub, Google Cloud, and Apple. To learn more, check out these links: Internet Report: Pulse Update Blog: https://www.thousandeyes.com/blog/internet-report-pulse-update-outages-and-distributed-dev-teams?utm_source=transistor&utm_medium=referral&utm_campaign=InternetReportPulseEp11 Explore the GitHub service degradation in the ThousandEyes platform ...
May 26, 2023•27 min•Ep 63•Transcript available on Metacast When it comes to your technology strategy, it's a good idea to have more than one way to access every resource—just in case. As IT environments have changed, so has the thinking around the right approaches to achieve this desired redundancy. Two recent incidents at Google Cloud and Microsoft 365 reinforce the importance of redundancy—and the need for evolving strategies to meet this goal. To learn more, check out these links: Internet Report: Pulse Update Blog: https://www.thousandeyes.com/blog/...
May 15, 2023•27 min•Ep 62•Transcript available on Metacast Understanding the unique characteristics of different kinds of Internet outages can help you quickly recognize the type of incident you’re dealing with and take the right steps to mitigate its impact. This week’s episode discusses the anatomy of common outage categories and explores recent case studies: - Security-related incidents: Western Digital and SD Worx outages - A single-point-of-aggregation issue: SpaceX’s Starlink outage - Last-mile challenges: Vodafone UK outage To learn more, check o...
Apr 28, 2023•18 min•Ep 61•Transcript available on Metacast This week’s Pulse Update unpacks OpenAI’s ChatGPT outage and discusses why the outage actually represented a pragmatic move on the part of OpenAI. We’ll also discuss global outage trends; explore other recent incidents at Dish Network, Microsoft, and Virgin Media UK; and look at why responses to performance problems vary, based on application characteristics and usage patterns. To learn more, check out the links below: - Internet Report: Pulse Update Blog: https://www.thousandeyes.com/blog/inter...
Apr 17, 2023•28 min•Ep 60•Transcript available on Metacast On April 4, 2023, Virgin Media UK (AS 5089) experienced two outages that impacted the reachability of its network and services to the global Internet. The two outages shared similar characteristics, including the withdrawal of routes to its network, traffic loss, and intermittent periods of service recovery. In this episode, we discuss how the outages unfolded and what IT teams can learn from this to help navigate similar incidents in the future. To learn more, check out the links below: - Blog:...
Apr 08, 2023•27 min•Ep 59•Transcript available on Metacast HTTP 403, 503, and 504 status codes dominated the last few weeks as multiple companies experienced application degradations and outages. These incidents at companies like Okta, Twitch, Reddit, and GitHub leave important lessons on navigating similar issues and minimizing downtime for your own users. To learn more, check out the links below: - Internet Report: Pulse Update Blog: https://www.thousandeyes.com/blog/internet-report-pulse-update-application-errors - Explore the Okta and Reddit outages...
Mar 31, 2023•25 min•Ep 58•Transcript available on Metacast It was an eventful fortnight on the Internet as Twitter, Dish Network, Akamai, and Ticketek Australia all experienced outages. Tune into our latest episode for insights from our analysis of these events and practical tips for IT teams. To learn more, check out the links below: - Internet Report: Pulse Update Blog: ttps://www.thousandeyes.com/blog/internet-report-pulse-update-twitter-outages-and-more - Explore the Twitter and Dish Network outages in the ThousandEyes platform (NO LOGIN REQUIRED): ...
Mar 20, 2023•26 min•Ep 57•Transcript available on Metacast In the space of a week, we saw two data center-related incidents lead to long Microsoft and Oracle outages. Join us as we analyze these outages and ways IT teams can minimize downtime in similar situations. We’ll also discuss a series of application issues that impacted companies including Twitter and Tesla. To learn more, check out the links below: Internet Report: Pulse Update Blog Explore the Atlassian outage in the ThousandEyes platform (NO LOGIN REQUIRED) Chapters 00:00 Intro 00:34 The Down...
Mar 04, 2023•22 min•Ep 56•Transcript available on Metacast We discuss insights from a recent trio of similar incidents at Microsoft, Cloudflare, and Slack, along with other outage news, including a Comcast outage that impacted some Philadelphia neighborhoods on Super Bowl Sunday. 00:00 Intro 00:58 Outage Trends: By the Numbers 4:33 Microsoft Outage (Jan. 25) 4:58 Cloudflare Outage (Jan. 24) 9:27 Slack Outage (Jan. 25) 13:16 Microsoft Outlook Outage (Feb. 7) 18:06 Square Outage (Feb. 7) 20:39 Comcast Outage (Feb. 12) 23:23 Get in Touch To learn more, che...
Feb 18, 2023•24 min•Ep 55•Transcript available on Metacast Live from #CiscoLiveEMEA, we discuss the Feb. 7 Microsoft Outlook outage to understand how the event unfolded, why it may have played out the way it did, and what you can learn from this outage event. To dive deeper, check out the links below: Explore the outage in the ThousandEyes platform (NO LOGIN REQUIRED) Microsoft Outlook Outage Analysis Blog (Feb. 7) Microsoft Outage Analysis Blog (Jan. 25) Want to get in touch? If you have questions, feedback, or guests you'd like to see featured on the ...
Feb 08, 2023•16 min•Ep 54•Transcript available on Metacast In this episode, we cover the latest internet trends and unpack important takeaways from the recent FAA, Fastly, and Microsoft outages. We also discuss how several early 2023 outages and disruptions reinforced the need for application monitoring and testing to counter, or at least anticipate the effect of, anomalous conditions on certain routes. 00:00 Intro 1:32 Outage Trends: Week of Jan. 30 7:07 FAA Outage (Jan. 11) 11:04 Fastly Outage (Jan. 19) 15:31 Microsoft 365 Outage (Jan. 17) 19:52 Micro...
Feb 03, 2023•30 min•Ep 53•Transcript available on Metacast At around 7:05 a.m. UTC on January 25, 2023, Microsoft started experiencing service related issues. At the same time, ThousandEyes observed BGP withdrawals and a significant number of route changes that resulted in a high amount of packet loss, ultimately affecting various services like Outlook, Teams, SharePoint, and others. 00:00 Welcome: This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. Join our co-hosts Angelique Medina, Head of Interne...
Jan 31, 2023•28 min•Ep 52•Transcript available on Metacast This episode covers the latest global network outage numbers and interesting end-of-year trends; how resilient application architectures, clouds, and networks are challenging old ways of thinking; and a deep dive into an outage that disrupted Spotify’s music streaming on December 14, 2022. To learn more, check out the links below: Internet Report Pulse Update Blog Explore the Spotify outage in the ThousandEyes platform (NO LOGIN REQUIRED) Part 1 Part 2 Part 3 Chapters 00:00 Intro 1:12 Outage Tre...
Jan 19, 2023•20 min•Ep 51•Transcript available on Metacast This is the Internet Report: Pulse Update, where we review and provide analysis of significant outages and trends across the Internet, from the previous two weeks. Every other week, we'll publish a new episode covering the latest tally of outage events, and highlighting a few interesting outages. This week, in addition to our usual look at global and U.S. outage trends, we’ll take a brief look at how Twitter is holding up since it's sale to Elon Musk, plus, a couple of interesting outages at Mic...
Dec 17, 2022•23 min•Ep 50•Transcript available on Metacast