Cisco ThousandEyes: Digital Experience Monitoring and Troubleshooting (Networking Technology)

Speaker 1

00:00

Okay, let's unpack this. In today's hyper connected world, our businesses, our work, and pretty much our entire social fabric are utterly reliant on one thing, the Internet. It's the invisible super highway connecting everything. But what happens when that backbone gets a little unpredictable, when traffic jams appear out of nowhere or a critical exit closes down. How do we even begin to figure out what's going on out there?

Speaker 2

00:25

It's a massive challenge, isn't it. Companies today operate across this incredibly vast, unregulated tapestry of independent networks, and that lack of clear visibility creates immense risk for applications that are absolutely critical to their operation. Today, we're taking a deep dive into Cisco thousand Eyes, often affectionately called the Google Maps for the Internet, to uncover how it brings much needed clarity to that chaos and helps pinpoint issues with well astonishing speed.

Speaker 1

00:51

Exactly so, our mission today is to give you a real shortcut to being well informed about how thousand Eyes helps monitor, troubleshoot and truly optimized digital experiences. Get ready for those aha moments that cut through the nors and show you the hidden paths your data actually travels. Let's rewind a bit though, to the genesis of all this picture. The late two thousands, inside the UCLA Internet Research Lab, two PhD students Moheite Lad and Ricardo Olivera made a

01:21

profound prediction. What was their big bet about the future of the Internet and what problem did they see emerging?

Speaker 2

01:27

Yeah, they had incredible foresight. Lad and Olivera saw a future where the Internet wouldn't just be a way to connect, it would become the de facto enterprise backbone. They envisioned a world where every single transaction, every application, every employee communication, every critical business function would rely entirely on this external network.

Speaker 1

01:44

Wow.

Speaker 2

01:45

The profound problem they identified was that if this prediction came true, businesses would be relying on a vast external system, a collection of independent networks they had no control over and crucially no visibility into. It. Begged the question, Yeah, how do you manage and fix something so vital yet so completely opaque?

Speaker 1

02:04

And that's where the famous Google Maps for the Internet comparison really shines. I guess they realized businesses needed a platform that could map every single handoff every segment of the Internet from one service provider to the next, not just your internal network, but the entire digital journey, allowing you to understand exactly where performance issues or breaks in the chain were occurring.

Speaker 2

02:26

That's spot on. And you know, when we talk about network performance, it often conjures up images of bandwidth. But here's a critical insight. Most engineers are quite surprised when we say there are only two network metrics that truly impact TCP application performance back at loss and latency.

Speaker 1

02:41

Only two.

Speaker 2

02:42

Really, Yeah, if your application is slow and neither of these has changed, then the network probably isn't the cultrit For real time UDP applications like voice and video, jitter is also crucial, of course, Okay, But bandwidth surprisingly only matters when a lack of it causes loss or latency, not as a primary measure of performance itself. It's not about how wide the pipe is, but how smoothly data flows through it, you know, And.

Speaker 1

03:07

That latency it's not just one big number, is it.

Speaker 2

03:10

Not at all? It's a sum of many tiny delays along the path, from the time a packet gets put on the wire serialization delay to waiting in line at a router that's queuing delay think airport security, but the pure distance it travels at lightspeed across fiber, which is an instant and rarely straight distance delay plus forwarding delay, even protocol delays. Thousand Eyes helps you pinpoint which of these specific slowdowns is actually causing the problem.

Speaker 1

03:37

Okay, So if the Internet is this vast, opaque system, how do we actually get eyes on what's happening? What's the core mechanism thousand Eyes uses to bring this digital world into view.

Speaker 2

03:48

Well, we deploy compact pieces of code, our agents as your eyes and ears, your vantage points into this digital world. We have three main types, each strategically placed for a different perspective. Enterprise agents deployed inside your network, behind your.

Speaker 1

04:02

Firewall okay, internal view exactly.

Speaker 2

04:05

These give you crucial visibility into your internal infrastructure, your data centers, your branch offices, even running on Cisco devices themselves. Then to monitor everything external, your sauce applications, public cloud services, and the Internet itself, seeing how your services look to the rest of the world.

Speaker 1

04:22

We use cloud agents, which thousandized managers.

Speaker 2

04:25

Right, thousandnized manages those globally and for the ultimate user centric view. Endpoint agents sit directly on your user's devices like their laptops, Windows or mac.

Speaker 1

04:34

Os ah, so right on the machine.

Speaker 2

04:36

Itself, precisely capturing their actual experience from their machine, over the local Wi Fi all the way to your applications. It's at end to end user perspective.

Speaker 1

04:45

That's a really comprehensive view. But what if you have a massive organization or I don't know, a single agent gets swamped with too many tests, you risk incomplete or missed data. Right.

Speaker 2

04:57

Absolutely, that's a really important point about skin ability, and it's precisely why thousand eyes uses agent clusters.

Speaker 1

05:04

Clusters.

Speaker 2

05:04

Yeah, think of it like a team of agents working together sharing the workload. You aggregate multiple agents into one logical entity and they distribute the testing amongst themselves. This ensures you're monitoring is always consistent and reliable, even during traffic spikes, delivering a complete picture without any dropped insights or skip test rounds.

Speaker 1

05:22

Okay, that makes sense. So with all this visibility, what does it mean for actually solving problems and getting to those critical aha moments you mentioned? Thousand Eyes breaks down monitoring into five key layers routing, network, DNS, web, and voice. Let's dig into a few examples that really bring these layers to life.

Speaker 2

05:40

Sure, A crucial detail here is the ability to monitor the Internet's literal directions. At the routing layer. Thousand Eyes tracks BGP data, which is like the Internet's postal service. You know. This lets you detect serious issues like BGP root leaks akin to a mistake in sharing directions unintentionally sending traffic where shouldn't go, or even worse, BGP route hijacking.

Speaker 1

06:03

Hijacking that sounds bad, it is.

Speaker 2

06:06

It's the unauthorized takeover of IP address blocks to divert traffic, maybe maliciously. So it's about ensuring your traffic reaches its intended destinations securely and along the intended path, and immediately alerting you when those fundamental directions go awry.

Speaker 1

06:21

Got it critical for security and just basic connectivity absolutely.

Speaker 2

06:25

Then moving up to the weblayer, you can certainly do basic HTTP server tests for simple availability and response time standard stuff. But here's where it gets truly powerful. Pageload tests. These simulate a real user visiting a website and give you an incredibly detailed user centric perspective. And detailed you get this amazing waterfall chart that visionally shows you the sequence and timing of every single element loading on a web page, images, scripts, CSS, everything.

Speaker 1

06:51

Ah, so you see exactly what's slow exactly.

Speaker 2

06:53

It's like seeing why a page feels slow, element by element, not just getting a single load time number.

Speaker 1

06:58

That sounds incredibly useful, But many user journeys aren't just a single pageload, are they. What if a user's experience involves multiple steps, like a checkout process.

Speaker 2

07:08

You're right, that's a common scenario, and for that we have transaction tests. Okay. These are scripted multi step user interactions that simulate an entire user journey. Imagine loading Amazon, searching for a product, adding it to a cart, and then completing the checkout.

Speaker 1

07:24

So you script the whole flow correct.

Speaker 2

07:27

This allows you to pinpoint performance issues across an entire workflow, not just a single page load. You can find bottlenecks in complex, multi stage applications that you'd otherwise miss.

Speaker 1

07:37

And for those critical collaboration tools and VoIP calls that have become so essential to our daily work, I assume there's something for that too.

Speaker 2

07:45

Absolutely. The voice layer offers SIP server tests for call initiation and authentication. That's the setup phase of any VoIP call it can the.

Speaker 1

07:53

Call even connect right, the handshake exactly, and.

Speaker 2

07:56

Then the RTP stream tests focus on the actual audio and video CUI quality of the call itself. These go far beyond general network performance, looking at specific metrics like mean Opinion score MOS, basically how good the call.

Speaker 1

08:09

Sounds, AH, the MS score YEP.

Speaker 2

08:11

Plus packet loss and jitter all directly relevant to voice and video codex. It's how you know if your video conference call sounds like you're talking through a tin can, and precisely why instead of just guessing it's the network.

Speaker 1

08:24

It's incredible how often we hear it's the network, even when traditional monitoring tools seem to disagree. Let's talk about some real world examples. Remember the customer who spent two million dollars overhauling their hosting architecture, but still had problems on NFL game days. Their existing monitoring wasn't telling the whole story.

Speaker 2

08:42

Oh yeah, that's a classic. And what I find particularly compelling about that story is how easily critical issues can hide in plain sight. The team was monitoring bandwidth utilization with five minute averages, which looks sline often peaking around say twenty thirty percent, perfectly normal. Right, it seems okay, But what they missed with the outbound discards tiny bursty moments were over half a million users simultaneously updated for an NFL play microbursts.

Speaker 1

09:08

Ah, the averages smoothed it out exactly.

Speaker 2

09:11

These microbursts overloaded the network interfaces, dropping packets silently. It wasn't about the pipes overall size, but its inability to handle these sudden spikes in demand. It really makes you wonder how many organizations are still flying blind to these kinds of subtle, bursty issues that cause severe operational impact.

Speaker 1

09:31

It's a scary thought.

Speaker 2

09:33

The solution, once identified with the right visibility, was a simple engineered delay to stagger user updates, drastically reducing discards. Thousand Eyes would have instantly shown the packet loss and the application impact, avoiding a multimillion dollar headache.

Speaker 1

09:48

That story really highlights how easy it is to miss critical issues with traditional monitoring. What's another common but often misdiagnosed problem that thousand Eyes helps untangle.

Speaker 2

09:57

Here's another fascinating one, A severity one incident where a company's largest customers couldn't log into a critical application I priority. Obviously, the load balancer appeared overloaded, yet it was accepting other sessions from smaller customers, so confusing signals weird. The real mystery deepened when they found one specific web server rejecting new connections while all the others were performing fine.

Speaker 1

10:20

Okay, so one bad server sort of.

Speaker 2

10:23

The core issue was a combination of network address translation NETP where many users appear as one single IP to the outside world, and a specific Layer three load balancing algorithm.

Speaker 1

10:36

How did those interact well?

Speaker 2

10:38

Because the load balancer was using that single source IP from the neat device to distribute traffic, all thousands of users from one big customer who were all needed behind that same single IP, were always routed to the same back end server.

Speaker 1

10:52

Oh so one server got hammered by the biggest customer.

Speaker 2

10:55

So I see. It was like a single massive customer group always hitting the same small checkout counter completely overwhelming its TCP session.

Speaker 1

11:01

Limit while the other counters were fine.

Speaker 2

11:03

Exactly, thousand eyes by monitoring both the load balancer and the individual servers would have shown connection failures specifically to that particular server long before customer escalation. It guides the team directly to the source.

Speaker 1

11:16

Wow, that's such a subtle interaction. To track down. And then there's the unforgettable Hurricane Katrina scenario. This one sounds dramatic.

Speaker 2

11:23

It was users in Gulf coast states suddenly started receiving a half page of HTML when trying to access a critical fuel purchase system. Just half the page, half a page. That's bizarre, totally bizarre, and users elsewhere were perfectly fine. Everyone initially assumed it was the storm or just the entire network collapsing down.

Speaker 1

11:43

There understandable assumption.

Speaker 2

11:45

Indeed, the initial trigger was a storm induced routing change, forcing traffic through a tunnel. This tunnel crucially reduced the maximum segment size MSS, essentially making the allowed packet size smaller along that path.

Speaker 1

11:57

Okay, so smaller packets allowed the.

Speaker 2

12:00

Root costs a bug in an unpatched firewall. When pages came through that were larger than this new smaller MSS, the firewall would prematurely cut them off, setting an fe.

Speaker 1

12:10

In packet like tearing a page and half midprint.

Speaker 2

12:13

Exactly like that, so users got only half the HTML. It makes you wonder how many subtle configuration mismatches or bugs like that hide in our networks, only to be exposed by a major event. Yeah, thousand ice path visualization would clearly show the MTU or MSS issues and package drops occurring at the firewall, turning weeks of painstaking packet capture analysis into mere hours, a much faster resolution.

Speaker 1

12:38

Just incredible granularity. Beyond these powerful tests and troubleshooting capabilities you mentioned integrations, that's right.

Speaker 2

12:45

Thousand nine seamlessly integrates with a whole host of platforms you might already be using like service now for incident management, WebEx for collaboration insights, app dynamics for application performance correlation, connecting the dots exactly, creating a truly full stack observable environment and from monitoring the health of your own physical network devices. It leverages device monitoring with SNMP and protocols like CDPLLVP to build a comprehensive network topology.

Speaker 1

13:12

Map, so it sees the devices and the paths between them.

Speaker 2

13:15

Precisely, giving you visibility not just into the paths, but into your underlying infrastructure health as well. Ultimately, thousand nine gives you unparalleled visibility and control over a digital landscape that was once for many a black box. It shifts troubleshooting from reactive guesswork you know, finger pointing.

Speaker 1

13:34

Oh yeah, the war room finger pointing right.

Speaker 2

13:36

A proactive data driven insight and that impacts everything from operational efficiency and reducing meantime to resolution to protecting your business revenue and brand reputation. It empowers teams to speak a common language about performance backed by shared data.

Speaker 1

13:50

You've now taken a deep dive into how Cisco thousand nins acts as your compass and map, helping you navigate the complex digital paths your services travel. It's a powerful to for understanding not just what is happening, but why it's happening and precisely where to fix it fast. It seems like essentral visibility these days. So as you reflect on this deep dive, consider this in an increasingly interconnected world, where does your critical digital experience journey end and where

14:16

do your blind spots begin? What critical hidden issues could you discover if you truly had a Google Maps for the Internet for your own organization

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript