The NetData Impact: Transforming Real-Time Monitoring and Observability in Modern Tech Operations -

00:05

What's going on? Everybody? Welcome back to another episode of Adventures in DevOps. I'm your host Will Button, but I'm flying solo today. Warren is at the DevOps conference in Zurich, so he'll be back with this next week. But meanwhile, joining me in the studio, I have the founder and CEO of net Data, Costa Sausis Costa. Welcome and I did Butcher your name didn't I? Yes, it's man. Well, welcome to the show. Thank you for joining me here. Thank you very much for inviting me.

00:47

It's very nice to be here. Yeah. So you're the founder and CEO of net Data, a monitoring solution for simplifying and modernizing infrastructure observability, which is that's a that's a huge task. I was you know, as we were talking before the show, we've been doing this for a while and so you know, we've learned a lot of lessons in the last few decades about doing this. So can you give our listeners a little bit about your

01:19

background and how you found yourself at this point in life. Yes, So that's a funny story because you know, the data is a monitoring solution that was let's say it happened by accident. I never wanted to build a monitoring solution, even actually when I was starting, when I had started building it, my intention was not to built a monitoring solution. So the idea, guys, is the following I was migrating some infrastructure from prem to cloud.

01:53

We had several several problems. It was very early in the cloud industry, let's say, after spending a few big budget actually and building a large team of skills and consultants and advisors and the likes, and with the help of the cloud provider six months past, no outcome. Problems were still there.

02:20

In the early days. I don't know if you remember this, we were talking about that cloud is a little bit alive and it behaves a little bit differently, and all this kind of all these kind of discussions, which my understanding, it's a little bit of garbage. It's it's at the end of the day. But anyway, after spending quite some time there, I found myself, you know, it was very painful. We had it was a

02:46

fintech company. We were doing transactions, payments on the poss et cetera, you know, cards, and we had many retail chains that we we were serving and the CUE the people that were waiting in line to actually finish the transactions, pay for the goods and go home. It was going around the blocks. It was stressful times. So by that time I started thinking,

03:19

come on, what is wrong? Why we cannot find what's happening? Why monitoring systems are so I had the impression that everything that they had built, you know, all the dashboards, all the tools, everything there was just something to make me feel happy. It didn't provide any value. But so initially this is how I started. I was so picked off that I started building a tool to consolidate all the consoles. So what I wanted is not to build a monitoring solution. I said, okay, we have Metrix and

03:59

data all over the let's build a tool to aggregate everything. That was the goal. The goal. So instead of having people being on the console of the database and the console of the systems, and the console of this or that, aggregate everything into one environment that we replaces the consoles. So the monitoring tool replaces the consoles. Then in order to do that, I said

04:25

a few goals. The first is, come on, I need the same fidelity if the consoles are per second, I want this thing to be per second if the if the consoles have I don't know ten thousand metrics, I want ten thousand metrics. So whatever the consoles do, no, no, no discounts at all. Once I started building these, and you know, the first generation of data was born, and actually I did the same thing

04:49

that the console tools do in many cases. So if you have a freeze or a point is missing, a sample is missing, and something cannot because elected, then I have a gap. It's not something that smooths out today's monitoring systems, most of them smooth it out. Yeah, but in the data from the first day, this was a gap. I failed to collect that thing at that time. So the idea is that I worked alone weekends and nights, and you know, I was a COO in the day and

05:30

an open source maintainer at night. So after building this for a couple of years and enlisted so and the people loved it, of course, mainly because it gave them all the fidelity that they were missing from monitoring tools, all the information. It's also fully automated, so does boards come up by themselves. Everything happens by itself. So with the discovery of metrics, everything so database or no moving parts, plugins, that the base everything isn't there.

06:03

So once I saw the low it's kirokeatet. I released it and it's kirokeated immediately, probably one of the fastest growing products on githubs. So it got ten thousand git caub starts in two weeks or something like that. Oh wow. So once I saw this, I said, okay, you know, initially you feel the you feel proud, but at the same time you feel the responsibility. You say, well, now I have built something that is in total thousands systems around the globe. I hope I didn't mess it up

06:35

somehow. So you feel the responsibility. So I started getting you know, ideas, and new people came in and they were contributing and all this kind of you know work that happens in communities. That was amazing. So a couple of years later I decided to start a company. I said, okay, with we have something here. This is not this is not a toy anymore. This is something important that people use every day to actually monitor their

07:08

systems. So the I this is how Neata was born. Of course I needed to find a plan for you know, because this is funded the funded companies, so you need somehow to make money and be in the market, have to go to market strategy, et cetera. So what I did then is that I decided that the most important thing to do is to keep this high fidelity nature of data, real time, high fidelity across the board everywhere. How do you scale that thing? That's that's a tricky part. How

07:40

do you scale it? Most of the monitoring solutions centralize everything, even the commercials and the data dogs and Dina trace and you really can all the all the all providers today, even the open source from et cetera. All of them centralize everything to one database and then query the database or this day basis one for medicine, for logs, et cetera. How do you scale it? The first thing is that I said, okay, we have something that

08:09

works at the edge and it's high fidelity. It collects everything, has a lot more information, amazing coverage of technologies, it is real time. Even the latency is amazing. So on the data has one second data collection visualization latency. So your presenter on the console to do a change and the next second that does work goes choosing the result. Oh wow, So I said, okay, how do we change that, how do we scale it? And then I said, okay, let's go distribute it. So instead of

08:43

centralizing everything to one place that becomes a hooge. After a while, it becomes overwhelmed that I said, okay, let's let's try a completely different approach. Let's have the data spread out across the infrastructure in little islands here and there, and let's figure out a way if we can do it to actually merge everything at quiritie can we have the data all over the infrastructure and at the dashboard you have the feeling that this is one thing you can see everything.

09:22

So that was the idea, and this is what we implemented. We spent a few years, we implemented the thing. So today in the data is a modern solution that you installed to discovered. Still all the same stuff exists today. So it's a modern solution that you don't need to configure anything, mainly because we don't cherry pick information. So on the centralized infrastructure monitoring systems you have to cherry pick. You have to know beforehand what metrids,

09:50

which methods you need, how frequently you need them. Since we eliminated this factor, so let's have everything in high resolution. Again. The next goal was, okay, why to configure anything? Since we can ingest everything, let's to discover everything. Let's just ingest everything by default. The next goal was, okay, since we now we ingest everything, why to go through the process of configuring dashboards metric by metaic insult by charge. Let's find a

10:22

way to create the meaningful doashboards out of the box by itself. So we attached metadata that allows the data to correlate the metrics run time and present them in beautiful dashboards meaningful doshboards. The same happened with alerts. Since we collect everything, we have everything instead of having alerts, you know the default threshold alerts that you have If the aggregated bumps goes above this trigger an alert. Instead of doing this, what we did is that, okay, can I

10:52

monitor component by components bottom up the entire infrastructure? So can we alarms for a disc, alarms for a netwical interface, alarms for a container, alarms for a posters database for an instance of posters, for an instance of an ENGINEX. So today we see about three hundred and fifty alerts that monitor components

11:15

of your infrastructure. Right, So, The idea is that you install a data and suddenly and out of the box in minutes, in seconds, you have a fully functional politory system that you didn't do anything to get it, apart from installing in the data. That is the beauty. Wow. Okay, we got to pause there because I'm just trying to wrap my head around this. Like hearing you say it, it just makes perfect sense. The

11:43

part that I'm struggling with is like, this makes so much sense. Why am I three decades into my career and just now having this revelation Because you know, like you said, like when you install a monitoring system, it's like it's like getting grilled by an interrogator. What day do you want? What stresholds are? What frequency? And you're like, man, I don't know. I just got here when you had these answers. Yes, and you know, we went a lot further. So, for example, let's

12:20

assume that you use that. So you installed it, you have one hundred several, two hundreds of thousand servers, you installed the data. Everything works by itself, you have alert you have doshboards. Everything works. Okay, now you go and see the dashboard. Wait a moment, you see the matrix for the first time. These are metals you are not familiar, right, right? Can we make the charts easy to grasp, to digest at first sight? And what information do we need on a chart? So we

12:54

invented the needle framework, the middle frame. Google is a little tool but above every chart that are allows you to understand where data are coming from, which notes, which instances, which them, and what labels they have and give you statistics about the sources that contribute to the chart. M hmm.

13:15

Once you have this, then we can we started discussing about Okay, now that we have really a lot of metrics and everything is automated and everything is visualized by default, can we come up you know, it's it's the trouble shooting that we say that there is a troubleshooting to why because we try to solve the problem of being efficient a troubleshooting time. So how how it works

13:41

for more for most monitoring systems. So you have an infrastructure and you face a problem there is a dive in your shales or your users or whatever you say, a dive or a chart. Okay, what do you do next? You start speculating, Oh, probably it's it's database. It's going the database if you don't have charge, let's figure out how to build charge for that. Let's validate the assumption. Oh, it's not the database, it should be the network. I think we have a problem in a network.

14:13

Right. This is how it works. You're speculating all the time and you hope that your experience will help you pinpoint the right cause because the monitoring itself cannot tell you. So what we did is that we added to the data supervised machine learning. So we train multiple machine learning models for every metric, multiple for every metric. And then what we do is that data is able to detect a normal list in real time based on the past of its metric,

14:48

so it is the path of its metric trends models. And then during a collection, it decides if they just collected value is an anomal orner it's an aler okay, And we store this in the database. So together with example, we say oh, this was a normalogy or no, no,

15:05

this was long a noomas. Now the beauty of this is that we create a tool we call it a dobe advisor that you highlight a spike or a diver, whatever is interesting, and we have we build a scoring engine inside inside the data, so it goes for that timeframe that you highlighted across all the metrics. It doesn't matter how many how many metrics are there, and scores them based on their anomaly rate. So your your your ham moment, Oh the disc did that is within the list, so you don't need to

15:41

speculate. We just go there, press a button and the data comes up with a list. You show the list. You say, okay, if this happened, this happened, I know, I know, I know what is this? What happened here? How the database did that, or why did this or why the network did that. So the idea is to simplify tremendously trouble to allow yousers be a lot more efficient in the resolution of problems. So not only do you have access to the data points from the metrics,

16:15

but you're also putting them in context. So when you look at the when you look at the graph or the dashboard, you see the numbers, which you're providing context to say is this a good number or is this a bad number? M HM. And we in our visualization we have added an anormally ribbon where you can see in real time what the anrmally the anomalists do

16:41

the machine learning does how the machine learning detects anomalists in real time. What we found also is so when we had all this infrastructure that was training all this kind of stuff and detecting anomalists in real time, we realize that anormalist happening clusters. So you go there and you see that they are normal lists

17:06

on across nodes happening in clusters within a node. So a lot of methods get anomalous wearing something or when there is an anomally within a node, but also a lot of notes together in very short time, one after the other, they get anomalous high in a great percentage. So we didn't know that. We realize this by we're reviewing the data and then we build a tune

17:33

to actually to allow people to review a normal list across the infrastructure. So now we have a chart, for example, that gives you for a line for every node that you have, and it's a percent that the percentage of metrics being anomalous concurrently, so you can see the strength of the anomally and the spread of the anomally that at the same time. So this is the

18:02

story. This is a short the story of data. Let's say we are trying to make observability a lot more, a lot easier for people to tell you that there was when develops started, there was this little diagram that the consultants or the consultants films. They were saying that, you know, devlops is a joint between data science, software engineering, and IT infrastructure, and were saying, at the at the the three bubbles, at the point they

18:41

join, all three of them, you have DeVos. My understanding, this is in theory it is okay, but in practice to have an extremely good data scientist that is also software engineer and it knows about IT architecture and the depth of the IT technology that exists. Come on, guys, this guy does not exist. I don't know if the world may have a couple of them, three of them, I don't know for sure. You cannot have one next to you. The idea is that monitoring needs to be simpler,

19:18

and it can be simpler. No need to learn query within the data. We don't. You don't need to learn query language so you can filter, slice dies the data. You know, it's like a tube. You can change the tube the way you see I fits by point and click to create

19:36

those words by drag and drop. So the idea is that we are trying, let's say, let's say that we we try to bring the technology in the monitoring technology that the best organization of this world have so real time per second high resolution machine learning everywhere and bring it to everyone in a very simply affordable package. Because the data also is mainly because of its distributed design,

20:08

is the most cost efficient solution. Oh right, yeah, because you're not stuck with a couple of years from now having to run these monster servers just to maintain the amount of data you got. Actually, we use resources, computer resources that are available and spare. It's your servers. They have two PERCENTPE to spare and I don't know, two hundred megabytes of RAM, that's easy. And this is what we use. Two percent cype of a single

20:37

core, three percent cype of a single core. This is what data needs. Resources and two hundred megabytes of RUM and I don't know one gigabyte or two gigabytes of this that's it. Well, So not to mention that if you compare with commercial offerings, all of them require tremendous mount of egress bandwidth. The data does not stream anywhere. It's it's inside there, so it out egress. Banquet will be used only when you view the dashboard. If

21:14

you don't view, there's nothing there, there's no egress. So that's the whole point. Simplify. Take something that is the best out there. We integrated in the data technoledge for example that we use a variation of Gorilla compression. But Facebook developed so Facebook has developed a real time high resolution monitoring system data base, time data base, and it's called Gorilla. We took the concept, we adapted to the data and now Gorilla compressions in the data.

21:49

So that's across the board. That's what we do. Across the board. We're trying to bring the best and give it to a This is why we say that we democratized monitor So when you get to this level where you're bringing in this granular data from all across your infrastructure, how do you how do you determine what's a surface for the user? Because that seems like an avenue

22:18

where you can get to information overload really really quickly. So the idea is that, Yeah, so the idea is that we group everything into meaningful stuff, so you have you go and it says, here are medical interfaces, the top information about network, the faces, this package and bandwidth for example in terrors. But then that are all that is that you may need in order to explore what's happening. The same happens everywhere, so your database,

22:48

ever, your even your tables. Do we go down to the intex level? Now all this information mainly because we have this scoring. Eagine when when you don't know what to do, so you are in a dashboard it has five hundred charge on it. Okay, what do I do now here? The first thing is that we there is a button that says, okay, identify for me the ones that are currently in the visible time frame, the

23:22

most unormalous for example, so to give you something to look at. You know, when you are looking for something, you don't know where to start, but it will identify the doshboard sections. Okay, here I have twenty percent. Normally this is bad, go look here. So the idea is one we've developed tools to help people with the information overloads, let's say. But the most important thing is that there are people that are afraid this information

23:56

overload. There are other people that enjoy the depth and the detail. What we hear from users is that when you use the data for some time to explore your infrast actions. Let's assume that you're not troubleshooting anything. You just want to understand. It's like feeling the pulse and the breath of the infrastruction. You feel it because you see it on every second. It's extremely high resolution, so you can understand what is really happening there. And I think

24:32

that this is the most important. Of course, it's a tool for people that want to learn. If someone wants to just traffic lights, oh it's healthy, it's not healthy. If someone wants just this, it's very whelming for them. But if someone wants to learn to dive, to understand to fix the problem, this is where in the data steps in. Wow,

25:00

that's wild, that's wild. I mean, you know, because I feel like from my own personal perspective, like you know, we've been approaching monitoring and observability the same way for so long and then this just completely flips all of that upside down. But that's the whole point, and I think this is the monitoring system that is missing today. So PROMI, for example, are amazing for customizability. You can build whatever imaginable you imagine it, you

25:34

build it great. The big guys data dog Dinah, Trace and they likes. They try to give you a silicopter view of the most important things. Although they value high resolution, they charge high resolutions, they are not by default. So I think that there was no tool to cover this, this this area of real time deep dive monitoring that you can go and take some in everything in high resolution and very detailed. To tell you the truth I have. The simplicity that we added to the tool is one thing that our

26:22

users love. So when I go and speak about data, I see in many I was enforced them, for example a month ago, and it was amazing because you see people that are skeptical about observability. There are some that are the point of observability denial. The idea is that what they have an argument. What they say is that come on, this is too complex,

26:52

too expensive for what I get for sure. So with what we tried, the the fact that we solve the cardinalti granularity problem and we can scale infinitely without actually becoming a problem, this allowed us to become a lot simpler. So it does a lot more. But the thing is simple. You don't have to do anything, you just have to use it. So this is I think that this is the booming factor, is that it is higher resolution and at the same time extremely easy. You don't have to do anything.

27:30

It doesn't require from you, not even resources. Just give it the resources that you already have and that's pair. So I think this is the This is the combination of things that make the data so appealing to you. Yeah, for sure, especially the ease of use thing, because a lot of the other observability tools and just my own personal experience with them is they they rely on me too much, and and they rely on me to know what questions to ask. And I'm like, man, if I knew a question

28:07

to ask, I probably wouldn't be asking. You wouldn't be Yeah, yeah, you know. The interesting part is that some people when it is when they hear, for example, a distributed monitoring solution that everything is at the edge, the first thing that they say is way the moment, man, this is going to be more heavy than the other agents, right. Yeah. We did a comparison on our side. There is a blog post where I could compare the agents of all monitoring solutions that we could find. The

28:41

data is one of the lightest. So the core of the data is written insane. It's severely optimized to be performance m of course we have plug ins and the likes that are high leveled and written in goal or whatever, but the core is x extremely optimized to be very efficient. For example, we have we did a stress test against against Prometheos, mainly because Promethels is the

29:10

industry studdard. So one third, let's SPU same lod. So we gave them three terabytes to almost three million metrics and we said per second, everything per second, and we said again, let's let's see the resources that both systems need. The data and Promethels the data used one third LESSIPU, half the memory, it used ten percent less bandwidth, ninety eight percent less diskyo, almost the disc was idle all the time with the data, and seventy

29:44

it managed to fit seventy five percent more data into the same store. So we have for example promythees has either uncompressed data on disc at two points something two point ten and if I remember correctly bites per example on disc or compressed with Gorilla that goes to one point three one point three bytes per sample. The data has zero point five zero point half a byte per sample on disc. It's extremely efficient. This is this is why we did it in c

30:23

to make it extremely efficient. Yeah, for sure, for sure. And you know the data traditionally was only metrics. But last year I said, we had built our own logs management thing, so it logs database that we're integrated into in the data. But then last year I said, wait, the moment I think we're doing it wrong, I realized that all of us have a new system, the journey. System the Journal is part of system G fort the logs, et cetera. This thing is amazing. Probably we

31:00

don't know it, but it is amazing. All the logs management systems they sufferre from cardinality. So how many streams of data are right right? System the Journal does not care. So every log line can have its own fields and its own values on these fields, and all of them will be indexed, all of them. So cardinality not the problem. Full indexing on everything the only problem that has. So even if you centralize, you put a lot of logs, web server logs for jump to it. The ingestion process,

31:41

et cetera. Are extremely optimized for system the journal. The only problem that it has is that they need disk space, so it requires more disk space than the rest, but it does more with the disk space, and at the same time it is secure. It's the only logs management solution that has us fall tolerance and high availability and even secure ceiling for securescess for ceiling,

32:08

so it ensures that the logs cannot be tempered. So the idea, the way we managed to use system the journal is exactly the same way with the data, so we use them in a distributed way. You don't need to centralize your logs. Why the whole world tries to build a logs processing pipeline and they try to install real time tools to do analysis and the likes and all this kind of stuff, you know, in order to get insights from the logs. And what we said is why do we need the pipeline

32:46

in the first place. We have an agent next to the logs that can quate the logs in real time, extract all the information needed and report it. So why why to move the logs from there? Let them be there. Yes, that's the same same approach as you were talking about with the

33:07

metrics collection you've got. You can just take a little bit of the CPU and the memory that's not being used on as system and just do the work there rather than the transport method of shipping it off someplace and dealing with this stuff in volume. Of course, we have centralization points, so if you have a femeral, you have a coubernetus cluster, you can have any data

33:30

partent. This is how we call it. It's the same software thet but you have a parent now, so all the ephemeral notes push metrics to it in real time. But these are small and only do the degree need it, so you don't need to centralize your entire infrastructure to this. And you can have multiple centralization points, one here, one there on if you are a high if you have if you use a hybrid cloud, you can have one on a WUS, one on a P one on Azure, one on

33:55

prem whatever. So the flexibility that you have. You can have the same system the journal, you can have centralization points which system Digenal methodology is not in the data system, the journal remote and system the general upload, but it's exactly the same philosophy centralized to the degree required for your operational needs. Yeah, and not because the system, the monitoring system requires it. So

34:22

it's your operational needs that require some centralization of some kind, gotcha. Yeah, So that puts the that puts the decision back on you to meet the needs of your business rather than putting the burden on you because of the tool that you chose. So and also you can centralize for high high availability. Do you need high availability? Of course, push the metrics somewhere, the logs pustem somewhere, to have two copies of them, of course. But

34:49

this is the idea. You centralize for operasonal needs, not because the monitoring system mandates it, right, Yeah, yeah, I got you. That's wild. This is just this is just I feel bad saying this, but it really is mind blowing to me because it just makes so much sense. And I'm just I'm just shocked that it's so obvious. But here we are. That's the beauty of it. You know that most of the monitoring systems, you know, the monitoring systems evolved. There is a Initially we had

35:29

nagios and the likes vise were check basis. So they run a check, they take a result. Together with the result, probably they have one or two metrics or some logline or something, a string to say what's wrong, and the check has status one or more status actions there it may be healthy and healthy or warning or whatever. All the systems, not the zubis, Essensiu, Insigna, solar, Aguain, spirited, the all of them are

36:00

in this the first generation of check based systems. Right then the world went to from technology that is metrics logs, so metrics databases, logs databases. It's not checks anymore. We gather the information, we put them in the database, and the go analyze the database. The problem with this is that it struggles at scales. It is expensive generally to run, so you need

36:29

to filter, you need to slower the resolution. Cherry pic metrics lose some metrics some information in order to have a performance system and a performance centralized system. Of course, this philosophy the guys, the data dog and Dinatries and the likes, they took it and they built integrated environments, very nice, very nice, inte great environments. So what we do is, I think

36:58

it is the next generation. It's the next evolution. So we take this philosophy of metrics, logs, et cetera, et cetera, where you don't do checks right, but the checks are on the logs and the metrics and the data, and we did it distribute it. So we did it in a way that it's still integrated, it's still one infrastructure, but we eliminated all the problems there that forced people so far to provide low resolution insights or eliminate several insights from it, et cetera. And I think that we did

37:37

it in a very efficient way. So the way I think of it is that the data is the next evolution of monitoring systems, and it's nice to see. You know, one aspect that we didn't speak so far is the following the data has. You know, monitoring systems are metrics, logs, traces. We don't do traces yet. I think we're when I started soon and it's going to be also distributed. But metrics locks, let's say,

38:05

for the number and traces when you have micro services. Developing micro services, what we found out is that there are a lot of information that is neither. It's not metrics, it's not logs, it's not traces. It's all the show cats, for example, all the network connections that the system has, it's all the the processes that run. It's so it's all the files that are open. So the idea is that we created a mechanism where are

38:37

our collectors the click ins that collect data. So they collect to a postgress for the connect to posts to collect for some some metrics. But they expose a function that allows the dashboard to say, okay, show me the slow queries of postives. So query posts the slow queries and give me a list of the slow which is that are currently running. Similarly, give me all the network connections or the outbound connection or the inbound connections, or the listening

39:07

circuits. Show me all the processes that are running. So the idea is that at the end we created a monitoring tool. It's the original idea when you're trying to kill all the consoles, who still do right the consoles?

39:27

Yeah, because that's like that's again that's additional context. You know, you get your list of slow queries from postgrass and eventually, if you don't quickly see oh well this is this is a query problem because we're missing an index or whatever, you do end up you know, sshing into that system and

39:45

running top and checking the open files handles and all of that stuff. Because no one can afford to run full time monitoring of that level of detail using some of the other monitoring tools like you you you would bankrupt your company trying to collect and store all of that data exactly, and this is exactly what the data changes. So it's more it's more cost deficient to run the data than running any other Commercially, it's more cost deficient because we don't need resources,

40:20

you don't need skills. It just works. Yeah, and and the information is just there when you need it. But it's not hurting when you don't need it. It's not costing you when you don't need it. Exactly. Wow, So what's the what's the onboarding process look like to start using that data? So today we have so there is the open source agent that you We have two paths how people can join the community. Two paths. One is you are a did have fun open source fun or whatever and enthusiasts.

41:01

You go and install the data the agent. The agent is monitoring in a box, so by itself on a single node, you will have it will do everything, does watch, alert everything, it will do that do them for you. That's one path, and then you will you do the second installation the third and then you realize that you can build parents or you

41:21

can use nedata Cloud to unify an infrastructure. The other part is that people that are looking companies that are looking to replace their monitoring system, they go to the first five through Google or whatever, they find our site. So they went to our site. There is a trial where they sign up to nedata Cloud this time and nedata Cloud instructs them to install agents with certain keys et cetera. In short, in order to link them with their own space,

41:52

their own account. So in both cases you start for free. Nedata Cloud is a thing do centralize data. So the data are still distributed, but nedda Cloud is a layer that builds a map of your topology, so it knows where are the nodes and what retention they have and what metrics they have without centralizing the metrics themselves the values. And then when you go to queer data, ne Data Cloud says, okay, I'm gonna quer this this and that notes to get the data, merges the data present them to you.

42:25

Okay, all these are in real time. All these yea, there's no latencies there. All these are like these very very quick. So the idea is that people can start either by the open source world, go to GitHub, download the software, install it, or they can from a commercial world go to nedata cloud, sign up a trial. Again, you're gonna download the agents, et cetera. So Neddada Cloud, our commercial offering, let's say, uses the agents as a distributed base, so we don't have

43:00

different enterprise agents. It's the same thing in the open source software. But what it provides is the following. The first it allows it allows you to skill your infrastructure horizontally. So if you don't use an data cloud, the only thing you can do is bigger, build a bigger parent, bigger and bigger and bigger to aggregate all the infrastructure there. Like the old traditional centralized systems, it skills better than them, but still it's one system. This

43:30

is one way with n data cloud. You can have as many independent centralization points or individual standard loone servers and all of them will become one at query type. The second is success from anywhere, so the agents. You need to hit the IP of the server to access the dark ware for the data cloud. It doesn't matter. You look into up in the data cloud and

43:51

then you access severity. The data cloud gives you a mobile lab for push notifications, so all the alerts across the infrastruction will be pushed to your mobile lab adroid and dios. And also it dispatches alert centrally. So instead of dispatching all the ages to dispatch alerts to slug or pay your duty or whatever email whatever you use, the data clouds receives all the transitions that dapplicates them and then dispatches the alert central. There is a free ti air two for

44:25

home users, et cetera. There is a small free air on data cloud But overall, you know, mainly because we decop it the cost of observability from the monitoring itself, it is a lot significantly more cost efficient. So if you go for the commercial offering, if you go to data with the legs, you start at twenty and thirty dollars a month per not right three oh wow. And even with data Dog, like when you start with that level like that, quickly you quickly learn that that's only the monitoring costs.

45:13

Like once you start bringing in the data, there's data charges as well, and those others I have actually here because we did this comparison about the resources that data Dog for example needs versus Data Sepe. Usage of the Data Dog agent fourteen percent the data through three point six memory usage data DOG almost a gigabyte nine hundred and seventy two megabytes are from the data one hundred and eighty one egress pernude eleven gabytes per month perndes. The data nothing doesn't need an

45:54

egress barnment. So it's you. It's more expensive and you put more resources to it compared to Yeah, wow, So are there are there you? You know, we mentioned early on in the podcast that your background started with retail transactions point of sale systems, which are very high throughput and if you've ever worked in retail with a customer, they're they're not the most they're the easiest things to deal with. So you come from a high stress environment.

46:38

Are there is that? Is that indicative of the type of customers who really embrace net data or is there a specific industry that you work really really well with. I think we have people, we have businesses from all over the old industries. So we have people from health care, we have people from manufacturing, we have people from technology, a lot of technology of course, right, So I think that the key point here is, Look, if you go to a DevOps guy that wake ups at three A at three am.

47:15

We have another like this that says, I love three am. If you have a guy that wake ups at three am, you have to understand that at three am has he's pissed off? So you have you have to be real time. We have many big companies, really among the Fortune five hundred say companies that they don't accept the latency that others provide. Yeah, a minute latency is what. No, at three am, I will not wait one or two or three minutes to see if fixed it or not.

48:00

I wish say it now. Yeah, okay, So that's the that's the idea you have. People have to value the fidelity the insights, so they need to. It's like people that want to learn more, to understand more, to feel the infrastructure more. This is our best clients, best customers, best users. On the other hand, people that they are afraid of this thing. If it works, don't touch it, let's rebut it. That's the funny thing because many people do that, Oh it doesn't work,

48:37

let's put it. It's okay. Then the data is not that useful for you to you you need Actually the first kind of monitoring, the check based the check based systems for sure you have just the light or not. So another thing that's that strikes me as being really cool about this is that you've got all of the observe observability data in one spot, because that's been one of the other challenges I've had over the years is you get the alert,

49:15

but now it's just an alert. You don't really have the context, so you have to go somewhere else to get the context of what what, why did this thing alert? What does this mean? And there's another spot for every alert that we see. We have a community site that people can go and see what others did about this. So you we have a CTA on the alert that goes to the forum about this alert where people have this alert. We have an introduction with that that we wrote ourselves. This is what

49:51

the alert means. This is how you what you should do if you have the system do like this, if you have the other system to like that. But then people discuss forum that's wild because like even even you know, there's just you get alerts for all kinds of things, and then you add to that that you're getting paged at three am. You know you you might need a little a little help there to understand the context. So thin within the data is that the community is vast. So so far we count more

50:22

than ten million users. The community grows with about five to ten thousand new users a day. Even Docker have downloads. For example, we have about one hundred and fifty two hundred thousand DOCKERHB downloads every day. Even on our Sash offering, we have about one hundred and fifty two hundred sign ups every day. Business that's that's that's a lot in a lot of project. And not only that, for example, the love that we see from users is

51:00

extreme. The data in terms of user love starts. For example, we lead the c and safe observability category. We sur passed Elastic in October. Now we are leading the observability category although we are not incubated where it seems if does not endorse the data. But it's the most loved project. Let's say that. Yeah, that's just crazy. I mean, and just af you're talking with you, I can see why you're getting those kinds of numbers. It's just I'm still just trying to wrap my head around it. It

51:38

almost seems too good to be true. I'll just be honest with you. It's like what's the catch? Yes, what's the catch here? I will tell you all the data is not not mature yet, so we're building that does boards are pretty new, it's less than a year old. Even our database, it's the third version of the database. We feel is the last year. So it's not that we have not built that many tools on top of the infrastructure of the monitoring infrastructure yet. So the baseline is there.

52:10

You have high fidelity, you know, unlimited methings. You have all the building blocks to actually do the work. We are lacking in a lot of high level tools right that we're building. Now that's okay, yeah, yeah, for sure. The good thing is that the foundation is very good. Yeah, and it's it's open source. Do you what's your what's the open source community look like? Do you find a lot of do you get a lot of poor requests and input? I will tell you we have a lot.

52:45

We have about four hundred fifty five hundred contributors. Some of them very few, are very skillful. Because we write in sea. If you remember the course in c this is pretty, this is pretty. It's not easy. In the early days, a lot of community used because come on, I used to be an engineer in the nineties. Then suddenly in twenty fourteen fifteen, I became an engineer again, so I was rusking. You can understand that. So sub in technology all these years, I was a coo

53:23

all of all my time, it was city o stuff like this. But suddenly I had to write quode myself and people helped me. So a lot of people stepped in and showed me how this is done, and you know, pushed me a little bit beyond my limits. All this happened. But I think that today in that I is a mature open source project. It's very robust. We also wanted to crassies and the likes of that. It's very nice, it's very reliable software. So I think as time passes it

54:04

becomes incrementally more difficult for people to contribute code. Yeah, and actually what happens now, we're going system the Journal D. We submit it to system D repositories patsies to make System the Journal fourteen times faster. So we are also a community contributor to that thing. So the System the Journal that you have in your in the next version that you're gonna have there, it's going to have parties of a data inside to be fourteen times faster than it work.

54:39

See, and that's cool that part, that part of open source really gets me excited. You know, where you you're consuming an open source thing and you're like, ohw this can be better, and rather than forking it or creating your own, like, contributing back to that, I think is a more idea. You want them to. You want the software that tend to be maintained and high quality. So and actually, the work of applying

55:09

patsies to your version, come on, that's not good. Yeah, It's better to have it there and they maintain it from now on for sure. Yeah, because yeah, I've seen that trend over the last few years of people patching or forking or creating a competing product, and I just I think we would all be so much better if you just contribute back to upstream, because you get all of the benefits of your work, plus the benefits that

55:39

everyone else in the community is contributing as well. If you think of it in the data of the day, because it's monitoring out of the box, it's an opinionated monitoring. So what happens is that when you're install the data in your infrastructure, you're monitoring team is the data. We are your monitoring things. You're just a consumer, right, using it the same way we want system D to be there for us exactly. It's utilizing the whole community

56:12

in order to provide high quality to end users. Yeah. Because and that's that's really key because very few of us, none well none of us really except for the people who work system D A are are getting paid to create you know, the journal Like our customers aren't interested in our journaling, they're interested in the product. That I think for projects like system D, this is easy because most of the engineers that work there are paid by other companies

56:45

by their company and work on that thing. So other commands red hat or mund this or that, they contribute resources that are working dedicated on this stuff. Yep, that's okay. So their job is to work on system D, but they work for some other company that uses system right for a startup like us, because we are startup still, so we just started we started monetizing five months ago. Oh wow, okay, this is pretty new. Yeah, very so for us, we cannot dedicate so much resources. But

57:23

for sure, the whole point is the community. The whole point is to aggrediate a lot of to unify the community, gather together a lot of value the users. Yeah, so early stage start up. What's the future? What are your what's on your roadmap for data? Oh? You know, monitoring is endless, endless, and there is no way for this to finish. What we are trying to do currently cover us much because we want to grow our sales. We are not sustained, so we need we still need

58:08

to prove that the data is sustainable and it deserves more. So when this process. The idea is that currently we are looking to help the users that have a budget. This is our main thing because this will allow us to survive. Once we pass through this, I think the next steps will be to address tracing, provide the high level tools that people need, become more

58:40

contextual in the US interface, and the likes. I think that the data has the winning mix, the winning combination in product design, easy, higher resuls hire insights, higher fidelity, the overwhelming part of you said, or we already know how to fix it, so people not to feel overy well by the amount of information if we switch to a more contextual approach at the

59:07

presentation level. So we're going to switch from overwhelming to very deep. Right, So these are these are changes that we need to do as we progress. But I think the first thing is people to to try in the data, seeing the data, if they use the data in production environments for commercial purposes, to buy a license because this will allow us to continue guys who are here to actually we are here to provide value. And for sure, if people don't buy the value, then what are we doing here? Yeah?

59:44

No, And and it's a it's a proven business model, you know, data Dog and and Dine Tracing those other companies have proven that you that there is a market for this, and so if you're able to tap into that market and provide better service at cheaper price. Yes, it's not only cheap, it's the overall cost that is better, all of it across the board, so even cost optimization. We see a trend for the unification of

01:00:15

the monitoring tools. So mostly because tools. Monitoring tools are not that comprehensive. Let's say they have certain aspects that are very good, but they're very bad at others. Mostly for these people use a lot of tools, right, so people are trying to consolidate tools. They are trying to save money, become more efficient in troubleshooting. This is what we do. This is

01:00:42

exactly what we're trying to do with the data. Simplify monitoring, make it accessible to everyone, let them focus on their job, not a monitoring tool, nothing to learn, just some familiarity with the tool and you're done. Stuff like this for sure. Yeah, Like simplification of the tool is huge because we're all getting increased responsibilities and increased scope of work. So and that technology becomes more complex. Really, absolutely, our time passes the complexity skyrokids.

01:01:17

You cannot you cannot fit it in your head anymore. So you need the tool to be smarter. Instant looks, get smarter tools, tools that do stuff by themselves. Yeah, and that to me, that's a huge, huge win here. Just providing the information in context, what do I need to know at this given time, rather than me having to go find it something that presents it is huge. Well, this has been an eye

01:01:47

opening conversation. I'm I'm excited. I mean, I monitoring gets let's be honest, it's one of those hard things to get excited about, but this is kind of exciting. It changes the dynamics, it changes everything, Yeah, it does. I mean I think back to like a few decades ago configuring nagios and zabs and thinking about how that just led to this here.

01:02:16

It's pretty cool. It's definitely definitely something worth checking out. So thank you for joining me today, Thank you for being here, for inviting me. It was great. I hope you enjoyed it. I did. I thoroughly enjoyed it because this is a perspective on monitoring and observability that I never had. I feel like I'm walking away from this conversation with a completely different set

01:02:45

of goals for how I'm going to approach this problem in the future. Like you know, you hear, especially in a start space, you hear the term disruptive throne around a lot, But I mean this one, it kind of like it fits that category and it's just such an odd place, like who thought, you know, you could disrupt observability, But here we are. YEA, Let's hope it works. It remains because you know, the devil is in the details for sure. Work hard to smooth everything out to

01:03:20

to make it as perfect as possible. Yeah, but it's a big only the integration that we have, you know, it's eight hundred integrations. Come on, it's it's and we're a small team. We're thirty people. Wow, that's impressive. That's impressive. So for all of our listeners, if they want to find out more about data, where can they go data dot cloud or no data monitoring on Google, your shirts on GitHub. The data and data is that Apple didubs last data, slast data. I think it's

01:03:54

quite popular, so it's easy to find. There's a big community, they speak about it and we have read it. Also there is an army data so yeah, right on it sounds good. Costa, thank you so much for joining me today. This has been opening and I really appreciate it. And to all the listeners, I hope y'all found this useful. I hope you go check out net data. I know I'm going to and I will see y'all next week. By it

Transcript source: Provided by creator in RSS feed: download file

The NetData Impact: Transforming Real-Time Monitoring and Observability in Modern Tech Operations - DevOps 203

Episode description

Transcript