Zabbix 5 IT Infrastructure Monitoring Cookbook: Explore the new features of Zabbix 5 for designing,

Speaker 1

00:00

Welcome back to the deep dive. We're here to give you that fast track, that real expertise on the industries well most critical topics. That's the plan, and today we are really diving deep our mission mastering modern IT infrastructure monitoring. Yeah, and our guide for this is some really comprehensive material compiled from the zabx five IT Infrastructure Monitoring pickbook right and for.

Speaker 2

00:26

You listening in, the goal here is simple, a complete, but you know, fast understanding of zabix five.

Speaker 3

00:32

Architecture cutting through the noise exactly.

Speaker 2

00:34

We're focusing on the core structure, the advanced ways it grabs data and crucially how to scale it. Everything you really need for a robust setup.

Speaker 1

00:43

And that timing on this couldn't be better really, zabx five it's an LTS release, long term support. Companies bet their infrastructure health on this for years. So getting these fundamentals right right now super critical.

Speaker 2

00:54

Absolutely.

Speaker 1

00:55

Okay, So let's unpack this a bit. What are the absolute foundational bits needed to just make as run and maybe more importantly, what's that single essential metric, the one thing that tells you if your serious monitoring setup can actually handle the load. You're throwing at it.

Speaker 2

01:11

Right. Architecturally, zabx needs three core components. They're constantly talking three main parts. Yeah, you've got the Zavic server. That's the central brain doing the polling, the processing.

Speaker 1

01:21

Got it.

Speaker 3

01:21

The engine the engine.

Speaker 2

01:23

Then the database usually Maria dB, maybe Postgres. Well, that's the huge repository for all your time series data. It gets big.

Speaker 3

01:30

The data store exactly.

Speaker 2

01:31

And finally the zabex front end. That's the WebUI you interact with, served up by Apache or NGI NX typically.

Speaker 1

01:41

So engine, data store, dashboard makes sense. Now, before we get into performance, where does a listener actually find all these metrics and settings we're about to talk about.

Speaker 2

01:50

Good point that dashboard. The front end its structured pretty clearly. When you log in, you see these main categories monitoring, inventory, reports, configuration and administration. Okay, all the key stuff we'll discuss today. It fits neatly into one of those five areas. Makes it easier to navigate. Now, that critical metric you asked about, the one that governs scale, that's.

Speaker 1

02:11

Nvps MVPs new values per second. Why is that number so incredibly important in a high volume system?

Speaker 2

02:20

Think of MVPs as the ingestion rate. It's the average number of new data points, new readings from your items that the Zavik server is either receiving or requesting every single.

Speaker 3

02:31

Second, per second, wow, per second.

Speaker 2

02:33

It is absolutely the single best measure of the load on your server, both right now and what you can expect.

Speaker 1

02:39

So it predicts the future almost.

Speaker 3

02:41

In a way.

Speaker 2

02:41

Yes, if you see that MVPs number climbing steadily, staying hi, that's your definitive signal. It tells you exactly when and how much you need to beef up your server's CPU and memory. Right, you ignore nvps at your own peril.

Speaker 1

02:53

Basically, Okay, message received. If MVPs is the heartbeat, let's talk about what we're actually feeding this engine. Monitoring isn't just about pings anymore. You need custom data, sometimes asynchronous stuff. How does zabx five give you that flexibility beyond just basic checks?

Speaker 2

03:08

Well, it starts the agent simple checks, like you know, is SSH port twenty two open? Those are still there, sure, But.

Speaker 1

03:14

The big shift is to Zavik's Agent two. It's written in.

Speaker 3

03:17

Go lang ah go okay, and that.

Speaker 1

03:20

Matters because Go natively handles concurrency and asynchronous tasks much better. The result a much lighter footprint on resources.

Speaker 2

03:29

Which is huge and containerized setups or just large virtual.

Speaker 3

03:32

Environments exactly invaluable.

Speaker 2

03:34

So that's the tech choice explained. But for really custom bespoke data, the sources talk a lot about this trapper mechanism. If agent two is so advanced, why do we need this separate trapper thing. Isn't that just adding complexity?

Speaker 1

03:47

It's a fair pultion. They serve slightly different roles, though the agent typically pulls on a set schedule right check this every sixty seconds. The trapper is for asynchronous data data that maybe comes from some external script on the host and you don't know exactly when it'll finish.

Speaker 2

04:02

Ah okay, like a long running job.

Speaker 1

04:04

Precisely, so, you set up a zabx trapper item on the server side. It just waits.

Speaker 2

04:08

Then on the host being monitored, you use the ZABC sender utility. That utility actively pushes the result, maybe from your complex Python script or cron job, straight to the server when the result is ready.

Speaker 1

04:21

So the server isn't asking, it's just listening for that specific data to.

Speaker 2

04:25

Arrive, exactly passive listening on the server, active sending from the host.

Speaker 1

04:29

That makes sense. It's the gateway for data that doesn't fit the regular polling schedule. Now here's where it seems to get really powerful. Though getting raw data is one thing, making it useful is another. Yeah, tell us about this preprocessing layer.

Speaker 2

04:42

Oh, preprocessing is an absolute game changer. It's data manipulation that happens before the value even lands.

Speaker 3

04:48

In the database, before storage. Okay, yes, and.

Speaker 2

04:50

Its main goal is efficiency. One way is using things like dependent.

Speaker 3

04:55

Items pendent item.

Speaker 2

04:56

Think of it like this. You make one single request to get say, a big chunk of status info a master item check right, then depend on items. Let you pull five specific metrics out of that single chunk of data, all for the cost of just one network call one interval check.

Speaker 1

05:13

Ah. So instead of five separate checks for CPU RAM disc uptime, you do one big check and peel off the bits.

Speaker 2

05:21

You need precisely. It saves a massive amount of overhead on the network in the agent.

Speaker 1

05:25

Okay, that's clever.

Speaker 2

05:26

And often that one big status check it dumps out just messy unstructured text. Think of the raw output from like if canfig on Linux run via system dot.

Speaker 1

05:36

Run yet tons of text?

Speaker 2

05:37

This is where what you called the rejecs hack comes in. You use preprocessing with a regular expression rejects to.

Speaker 3

05:43

Scan that raw text, find the pattern.

Speaker 2

05:45

And cleanly extract just the single number you care about, like total rx bites for interface ends one ninety two. You turn that text chaos into a clean, usable number right at the door.

Speaker 1

05:55

That's fantastic. It shifts the cleanup work away from the network into the server's processing, where it's more efficient. I think you remember another database tip related to this, something about duplicates.

Speaker 2

06:05

Yes, exactly, for values that don't change often, maybe a server serial number for a more version, that kind of thing that's static mostly right, you use the discard unchanged preprocessing step. If zavix gets the same value one hundred times in a row, it won't write one hundred identical entries into the database.

Speaker 1

06:22

Ah, so it just stores the first one and ignores the rest until it changes.

Speaker 3

06:26

Correct.

Speaker 2

06:26

This prevents just pointless bloating of your database over time. Super important for long term.

Speaker 1

06:31

Health, definitely. Okay, so we've collected and cleaned the data. Now we need to act on it. Alerting. We've all been there. The flapping alert goes critical then okay, then critical fifty times an hour floods your inbox. How does Zavix help stop that alert fatigue nightmare?

Speaker 2

06:47

The cure for that horror show is the recovery expression. Setting a simple trigger is easy CPU usage fifty percent trigger an alert.

Speaker 1

06:55

Right, and the naive setup recovers as soon as it hits forty nine percent.

Speaker 2

06:59

Exactly which to flapping. The recovery expression demands stability. You define a separate condition for the alert to clear, it has to move significantly away from the problem threshold.

Speaker 1

07:09

So like trigger a fifty percent, but only recover when it's back down to say forty percent.

Speaker 2

07:13

Pcisely that the system has to prove its stable and well clear of the danger zone before the alarm goes silent.

Speaker 1

07:19

That makes the alerts actually meaningful again.

Speaker 2

07:22

Nice, absolutely, and organizationally, you need structure too. Use tags from day one, AD service SSH or application billing to every related trigger so you.

Speaker 1

07:31

Can filter and rout alerts properly.

Speaker 2

07:33

Yes, and customize your severity levels. Don't stick with the defaults like disaster high warning. If your company uses P one, P two, P three, change the names in Zavik so the alerts immediately make sense in your team's context.

Speaker 1

07:46

Good practical tips. Now let's talk scaling. The actual architecture. MVPs tells us the load is high. But for medium, large, geographically dispersed environments, what's the component that handles spreading out that monitoring work.

Speaker 2

08:01

That's where zab's proxies come in. They are absolutely essential for offloading work from the central ZABK server.

Speaker 1

08:06

How do they do that?

Speaker 2

08:07

A proxy sits closer to the devices it monitors, maybe in a remote data center or a specific network segment. It collects data locally, holds on to it, maybe does some pre processing.

Speaker 1

08:17

Preprocessing can happen on the proxy too.

Speaker 2

08:19

Yes, some of it. Then it compresses that data and forwards it efficiently back to the main ZABC server. It reduces the load on the central server significantly.

Speaker 1

08:28

Okay, Now, when deploying these proxies, you've got firewalls, network segmentation. Are there different types performance trade offs?

Speaker 3

08:35

Yeah?

Speaker 2

08:36

There are two main modes, passive and active proxies. We strongly strongly recommend active proxies. Why active Two big reasons. First, they're generally faster because they proactively push their collected data to the server whenever they have new stuff. They don't wait to be asked.

Speaker 1

08:51

Okay, less latency, right.

Speaker 2

08:53

But second, and often more importantly, for network teams, an active proxy only needs one outbound connection and initiated from the proxy to the server.

Speaker 3

09:02

Uh so the proxy calls home exactly.

Speaker 2

09:04

Compare that to passive proxies, where the central server has to be able to initiate connections to potentially dozens or hundreds of proxies. Firewall rule nightmare.

Speaker 1

09:14

Definitely simplifying firewall management is a massive win in big companies. Okay, Sticking with operational wins, Yeah, let's hit the biggest maintenance headache for any monitoring system over time, the database. How do we stop it from becoming this unmanageable beast? Yeah?

Speaker 2

09:31

The database is always the challenge long term. The bottlenet comes from the default Zabas process called the housekeeper.

Speaker 1

09:37

Housekeeper sounds helpful, well.

Speaker 2

09:39

It tries to be. When data gets old. Say your history retention is ninety days, the housekeeper is responsible for deleting data older than that.

Speaker 3

09:46

Okay, seems necessary, but it deletes.

Speaker 2

09:48

That data roe bi row. Imagine millions, maybe billions of rows as your database grows into terabytes. This row by roe deletion just consumes immense IOCPU. Yeah, it can bring your server performance to its knees during cleanup OUCH.

Speaker 1

10:04

So relying on the built and cleaner eventually just grinds everything to a halt. What's the better way? The advanced solution?

Speaker 2

10:10

You absolutely need to move to database native methods. For my sqel, that's my Sqel partitioning. For postgrescul it's leveraging the timescale dB extension, which brings time series superpowers to.

Speaker 1

10:19

Postgross partitioning or timescale deb How do they avoid the row by row problem?

Speaker 2

10:25

They work with time based chunks instead of deleting individual rows. The database is structured so you can just drop an entire old partition, say delete all data from March instantly.

Speaker 1

10:37

Like throwing away a whole filing cabinet drawer instead of shredding each paper inside.

Speaker 2

10:41

Exactly that analogy. It's incredibly efficient. The cleanup becomes almost instantaneous, freeing up massive resources.

Speaker 1

10:48

That sounds like a no brainer from a performance perspective. But does switching to partitioning have any functional trade offs for the user? Configuring zavig it does.

Speaker 2

10:58

Yeah, that's the key planning points. You implement native partitioning your history and trend data retention settings often become global database parameters. You typically lose the ability to set say seven days history for this item, but ninety days for that item.

Speaker 1

11:12

Ah. So you gain huge efficiency, but lose some of that fine grain control over retention per item.

Speaker 2

11:18

That's the main trade off. You need to plan your retention strategy more globally.

Speaker 1

11:22

Got it important consideration? Okay, last, big area hybrid cloud. Lots of critical metrics now live in places like AWS, cloud Watch or Azure Monitor. How does zabix stay relevant? How does it pull data that's behind these proprietary cloud CLIs.

Speaker 2

11:39

Zabx uses a really powerful combo here, zabks agent user parameters plus the cloud provider's own CLI tools.

Speaker 1

11:45

Okay, so you can saw the awcli or the azurecli where.

Speaker 2

11:49

Right onto the same machine where the zabs Agent two is running. Could be an EC two instance, could be on prem machine that needs to query the cloud. Okay, then you can figure a user parameter within the Zavik's agent confis figuration. This user parameter basically just tells the agent run this specific AWSCLI command.

Speaker 1

12:06

The agent cs like a secure little proxy to run the cloud command locally precisely.

Speaker 2

12:10

The agent executes, the command, gets the output, maybe your sqsq depth or soupu utilization from cloud watch, and then it injects that value straight into Zavix. Like any other metric.

Speaker 1

12:20

It collected very flexible. Basically, if you can script it on the command line, zax can monitor it.

Speaker 2

12:24

That's the power of it. And for containers, Agent two makes it even easier. It has native plug ins for things like dock or monitoring built right in again thanks to that Go architecture.

Speaker 1

12:33

Okay, pulling it all together, then, what does this mean for you the listener? We've seen Zavix five is well, it's clearly robust, super flexible, scalable, designed for way more than just simple pings. It handles complex enterprise level stuff.

Speaker 3

12:47

Yeah.

Speaker 2

12:48

I think the key takeaway is that real control, real scalability. It requires planning upfront, especially especially around the database. You have to decide on partitioning or timescale dB early, don't wait to it hurts, don't wait till it hurts, and getting good at that data extraction using preprocessing. That's non negotiable if you want to keep your database clean and your metrics really trustworthy.

Speaker 1

13:09

Fantastic summary. Now, building on that idea, of platform control. We talked about how Agent two can execute commands to pull data. Yeah, but what about controlling Zavix itself. Here's a final thought. Even a user who's locked down to the basic Zavis user.

Speaker 2

13:23

Role, right, the most basic view only type role usually.

Speaker 1

13:27

Exactly they can maybe only see metrics for their own couple of hosts. Even that user can potentially execute powerful administrative actions, things like enabling or disabling host on maps or scheduling system maintenance periods. They can trigger these right from the Zavix front end.

Speaker 2

13:44

Wait, hang on a basic user doing admin tasks. How that sounds like a massive security hole if they don't have the actual permissions.

Speaker 1

13:52

It sounds like it, but it's actually the ultimate delegation power of the ZAVICSAPI. The trick is the script they trigger from the frontend. It isn't running with their lowly user permissions. Instead, that custom script you've set up is configured behind the scenes to use the API credentials of a different user, one who does have administrative privileges.

Speaker 2

14:11

So the low level user clicks a button, but the action is performed via the API using a high privileged token or user configured in that script.

Speaker 1

14:19

Exactly. It lets you safely delegate very specific, complex administrative workflows like putting a server into maintenance for exactly two hours starting now, to users who absolutely should not have general admin rights to the whole system.

Speaker 3

14:31

Wow.

Speaker 2

14:32

Okay, that is powerful the API as a controlled delegation mechanism.

Speaker 1

14:37

That's the raw, potent and highly customizable power hidden within the Zavix API. Food for thought. Definitely thanks for diving deep with us today. We really hope you feel better equipped now to tackle your infrastructure monitoring challenges using Zavix five

Transcript source: Provided by creator in RSS feed: download file

Zabbix 5 IT Infrastructure Monitoring Cookbook: Explore the new features of Zabbix 5 for designing, building, and maintaining your Zabbix

Episode description

Transcript