#12 - Observability | DevOps Topeaks podcast - Listen or read transcript on Metacast

00:00

It's a hook. Oh, oh my lady. I am ready. Okay, and action No, wait wait wait props props. Okay, I got my hands Okay, so we started. Hello everyone and welcome to the 11th Well, and the 11th well, well, yeah, it's because we skipped the you know the the last one We had no secret we didn't skip anything anyone don't tell them. So we don't have a YouTube episode. It got screwed

00:32

Okay, so we are on the 12th episode. We did so many episodes, so we we lost count and Like it's a thing like a hundred and 12 maybe so it started the counting at a hundred like ten years something like that Yeah, something like that. So today we are going to talk about observability Yeah, absolutely. It's a beautiful topic Right, so this today's topic and of course the same question each time Oh, man. Oh, man. What time is it right now?

01:05

Where What's the first thing that comes up to your mind when you think of DevOps and observability? What do you think about that? Okay? Actually when coming to observability since it's one of these Bazwords we often choose these topics like platform engineering It's and kind of the it's not obvious right because when you say platform engineering another topic

01:31

What's the first thing that comes into your mind? I don't know maybe a platform team or something when you see observability pretty much anyone I think the first thing that comes to their mind is monitoring right because a big aspect of our day-to-day job Or wherever you are right you're an ops DevOps

01:49

SRE whatever you call yourself. You need to monitor services you need to monitor applications servers machines running in the cloud any kind of resource you want to monitor and The reason you do that is you want to kind of inform yourself and the people surrounding you of what is the system's status

02:06

And the system's status is not only to understand what's going on. It also applies it it kind of Affects the business right because if the systems are down the business is down the systems are up It means the business is functioning Sometimes you want to have this status board that tells your customers also if everything's fine and whether the Application can be worked with or not So that's the first thing that comes into your mind

02:30

But I think observability or I actually know that observability is a bit more than that So monitoring is basically this is a bitch. Yeah

02:37

Just a bit. Just a bit. Okay. Yeah. Okay. So monitor. So you're gonna talk about like what monitor now or more than that So I'm saying monitoring first of all is just the notion of having something visual that I can look at and maybe connect some kind of alerts Do that so I'm actually monitoring it like any other kind of monitoring I know the health of the system and I can be alerted on that automatically or not automatically if something goes wrong

03:03

Something goes wrong. I think usually that's a certain threshold of a metric above 6070% CPU or my memory is Scaryotic hitting or anything like that and then I'm alerting to someone some people. I don't know observability however is a kind of a bigger picture view of everything and it's beyond that It's going through the entire chain of knowing what's going on in the system and by that I mean is knowing first of all you need to start somewhere and that's

03:34

Collecting rights. I need to collect metrics and logs and everything in order to do something with them afterwards So after you collect them you transfer them into That can be your monitoring system and after you have Metrics and logs and everything you've collected in the monitoring system You probably want to analyze that right and have some insights So beyond just alerts of 60% CPU do something save me you maybe want to extract certain things from the logs

04:02

You want to monitor performance. Maybe you have some kind of tracing abilities to know whether you have bottlenecks in your system So it's a bird's eye view of your entire system and it requires you to handle the entire chain like collect my stuff Store them monitor them extract metrics But then you want to also analyze them and extract insights from that So that's the kind of a big notion of observability Let's double-click. We like double-clicking. Are you ready for the double-click?

04:29

I'm ready. Okay. You said extracting. All right, so now when you say extracting data I'm thinking well, you know, let's say my application sends requests or suits receive requests and everything and I want to measure my application So for that and I just want you to answer it because I think it's like it's a good You know question question and answer So people will be aware that they can create something special for their application So what if I want to extract data and measure my application?

05:01

What should I do? How can I monitor my application when it comes to metrics to logs and something? What can I do? Okay, so Like with every question in we've ever asked. I think Yeah, so it depends. I think the obvious place we can go to is a container orchestration system Usually that would be Kubernetes and in Kubernetes the standards are pretty much the same It's usually something like having a collector maybe file beat or something like that that collects the logs with container

05:34

It's relatively easy because you don't have to actually know the application. It's a container So everything that goes to std out to your output just is being stored as Text file that's generated by the container and then your file beat or whatever collector you have running on the on the node Is just collecting those logs and Shipping them somewhere sometimes you'll have something in the middle like log stash which is transformational object that can

05:59

Make changes to your text logs like on the way on the stream And then you'll store them somewhere maybe use read this in the system to kind of Aggregate the logs and remove some kind of the load from your storage system Then you store them in a place like elastic search by the way Everything I'm mentioning is a is the standard ELK or EFK today, which stands for the last stick Maybe log style and so elastic then log stash or file beat or whatever and then what we have in the end

06:34

KK Kibana for visualizing and probably producing metrics and other others. Sorry So that's the standard right so if you want to do something like that you probably want to have

06:44

ELK or EFK in your system in Kubernetes. It's pretty standard In Kubernetes along come comes a system like Sorry promittives so promittives can monitor your applications on the cluster and can actually Read a set of metrics from text files by crawling them and analyzing them later on in a time series manner So for example, I can monitor a certain application and Monitor it every minute or every second or every 10 seconds and grab a set of metrics

07:16

And then I'm I'm holding them with their timestamps in a timestamps database in that kind of a model And then I can visualize it in a graph and tell you okay This kind of metric may be the number of users you have in your application registered users It went up this way and then when that way, etc, etc. So that's another way of Doing stuff is monitoring application metrics I hope I think you're thinking like I want you to go this way exactly so when you speak about promittives

07:46

You know the time series database and then you we spoke about extracting data from applications Some thinking I just want to make sure so you're failing to Writing your own exposure to your application. So it will write the metrics to Prometheus or are you talking like maybe is there a generic Exporter that can extract data from your application because I I'm aware only of writing your own You know like you need to analyze your

08:10

Application then dump data to some end point and then Prometheus will escape it. You know Yeah, so I think we shouldn't do this about promittives like the entire thing But specifically if I'm not mistaken promittives is used to this kind of us I mean, I think Prometheus was developed by SoundCloud and they kind of made up this standard of having slash metrics on your application

08:29

It doesn't have to be slash metrics. That's just the convention and then You just use a key value system of text That's the key is the metric well name and the value is the numeric or whatever other value use And that's being collected by Prometheus the standard ways. It's crawling you for slash metrics in Kubernetes

08:49

It's done by adding a certain annotation. I think on the application And then it knows that it needs to scan it look for slash metrics or another end point and just collect the Collect everything that's being exported On the other hand. Yes, like you said, we can write an exporter. Maybe have something instead of being crawled

09:07

We can intentionally push into Prometheus at another option. That's just the way of handling Castometrics on a cluster So we touched on collecting monitoring Analyzing kind of because we talked about visualizing with kibana or adding profana in order to even make it nicer By the way, did you know that profana is a fork of kibana? No, that was kind of funny about a few months ago. I don't know. I just don't know I just do so it's there

09:40

For it. That's new to me. I want to add also to those exporters, you know, so in case we don't want to write custom exporters So let's say my application is using MongoDB or any other, you know, cool database. So there are native exporters maybe by the Provided itself, you know, at the base Like a lady's exporter. I don't know any other application if you want to monitor you can use their exporter instead of writing your own

10:03

You know, so that's also something cool. You can do if you want to add the monitoring layer to your application Correct Okay, let's talk about logs because I'm scared of logs. Okay, because if you make a mistake with logs Yeah, remember we talked about last week like the debug trace is true and then you get gigabytes of logs and then you pay tons of money Okay, start a question to you. Yeah, okay. It's a tough one Do you manage or not? No, I'll just give up Okay, so it's an easy one

10:36

Okay, I'll ask you an easy question. It's a question. I'm good with any questions. Yeah. Okay, so it's easy Do you manage or do you prefer to manage all your logs in One system or do you separate the logs to let's say two three four different system depends on the needs like Do you push like to to have one stop shop to all of your logs or do you or maybe you can split it between Cloud watch and you know graphon are lucky I you know trying to split everything in

11:07

One stop shop or splitting it spell environment bell application I would like to have one central system to use for everything I would like to have some kind of segregation So with cloud, which is a bit harder Although if you have separation between like different accounts are actually monitoring different environments that's relatively simple If you use a service provider, we can name names, but there are tons of them like I can I have

11:29

Five companies and that's just from the top of my head managing logs They all know how to manage different environments. That's pretty easy But you can filter them with with tags prefixes Depends on the convention using the system, but

11:43

To answer your questions. Yes, I'd like to have it in one system so I can query everything And another layer on top of that is sometimes things I'm not even aware of when I'm querying something I might Kind of stumble upon things I didn't even mean to maybe I see exceptions that I didn't mean to even search

12:02

Maybe on another application and because I'm using the same aggregated system. I might be kind of Aware of them whereas if I had a different system monitoring a different set of application or a different set of environments, I would not see that So I have a follow-up question about that so since you prefer you know prefer having one system to hold them all

12:23

Okay, that will all the logs. Yeah So if let's say QA or some other thing decide to do some You know load testing stress testing Something like that So the logs are gonna get crazy and then you're gonna pay tons of money So if you chose Like you know a cloud provider that provides you logs, you know usually they cost a lot for each push The storage is usually very cheap, but pushing logs costs a lot of money

12:49

So let's say I just bomb your application with tons of logs because you know massive use stress test load test And since you're using the same for staging development and production You're gonna pay a lot of money just because you're all testing. So how can we avoid that if we can?

13:07

Okay, yeah Smiling ever since you started asking the question because all I have in my mind hearing you ask the question is I want to I want to use this platform And this entire crowd that is certainly with us today and say please Yeah, yeah, the developers of the world. Please stop pushing debug equal is true to every production application ever existed It's just

13:31

Please please please don't set it hard-coded in your configuration files. Please don't do it Even if you use some kind of a metric system that you pull it from don't do it hard-coded Please please please do it based on your environment Debug true is the one flag that you think doesn't matter a lot But I want to debug my application. I want to see what's going on It's okay to switch on for a little while

13:54

Nine times out of ten when you do that. It's kind of perpetuates to every single part of the application and exactly what you said Everything just start blowing up and if that's even without debug, you know, let's let's even leave the debug alone I thought you like let's say a stress test or maybe one of the QA members and you know I think the developers and QA members should work because if they want to work they can do whatever they want

14:17

I don't want to limit them because the logging system costs a lot of money right so So like to me like my solution for that is yeah, sometimes I will separate, you know production having Maybe some cloud with some expensive cloud provider logs and for my staging and development I will use maybe Loki and Gafana for that, you know, it's a good point, but it's a problematic point

14:41

And that's a philosophical discussion. Why? Yeah, that's why we are here. Yeah. Okay. Let's say you do that exactly that and I am

14:48

We all do that like I'm a sinner with night literally today. That's what we do We we have something switched on for production exactly like you said and it's not a preference It's only because of costs But having that in place meaning I don't have the same quality or visualization or logging system On my staging and that means a lot of bugs are probably finding their way through staging into production Soly because I don't have that system in place

15:15

And then you can say oh, yeah, I didn't switch it on because it costs money But what about the the the cost of having a bug into production and maybe a Customer is churned for that or maybe engineering hours are now stretched into the night and someone has to sit and fix that For hours days weeks. So I think the best practice with everything is having exactly the same at least for staging and production Right if you have other environments like

15:41

Their work UI environments and other things like that. It doesn't make sense not to have not having to replicate everything But at least for one environment that's not production I think it's wise to switch things on and have them exactly the same because that's the one gatekeeping Let's say process from getting to your production

16:01

Okay, I'm buying that solution. So assuming I'm quite a big start up I'd say medium to big startup maybe or a company I have you know as you said development QA And whatever you know those development environments I'll put them with use On-prem or self-hosted probably not on-prem But self-hosted logging system and the cloud provider for the production and staging Okay, cool What if I'm so so so so so so so so so so so so so small and I only have two environments staging and production

16:34

So would you recommend using them both like putting them both in With the same cloud provider So let's say, you know some startups because you know how it is We can speak and talk about we should have separate environments We should have whatever but some people are listening and saying this and dude I have staging which also serves as sometimes for development and stuff and I only have two environments What should I do then I also want to have to help those people So again

17:03

I don't think it matters how big you are it only matters how crucial production is for you And that basically is like a byproduct of how many customers you have or how critical is it in the process in the life That's cycle of your company if you have one customer and super critical if that that customer will be the breaking point Of whether this company is going to succeed or not

17:26

That's the make or break part of the company. You probably do want to follow best practices That said, it might be not worth your time separating the environments completely if it takes I don't know two weeks work time where you have one engineer working on the project So it's very much depends. I look the best practice is to separate environments

17:44

That probably won't be the thing that kills you. So you make your decisions One thing I wanted to say though in terms of having the same systems The one that you asked before and we we talked about If you're just one engineer or one customer or no no customers at all I think it's Probably likely you're going to use the free tiers of everything Just to name names new relic data dog Elastic all these you can probably use their free tier and enjoy it and and get really extreme value out of it

18:16

So I'm just putting it out there and I'm building on top of that point to add another layer I think is worth discussing in the Under the umbrella of observability which we didn't and that's APM And I was reminded just because of naming new relic. So just one example. So APM stands for A application or something performance

18:40

Performance monitoring. Yeah, something about anyway A small component you install inside your application and can yield a ton of metrics and Insights telling you what's going on with the application It can it can let you know what's going on with the CPU cores and the memory But that's just a small part of it and that's relatively easy to understand It will find bottlenecks and tracing and let you know how things perform to the granularity of a single function

19:09

Depending on the solution. It's immense. I mean, I can't recommend it enough Very hard to implement in serverless probably costly But if you're working in containers, that's relatively easy I think depends on the language or the runtime you're working in right but The tons of solutions out there that can help probably most of

19:31

Over on time. So there's one layer that's important to mention APM is crucial to your success and your Ability to let's use the term that we're considering observe That will help you observe what's actually going on because if you have a problem that you can't find in the logs or the metrics They don't tell you the entire story. You can actually drill down into the metrics of a specific application and see what's going on

19:52

Maybe you have a memory leak. Maybe one of the functions is killing you. Maybe you have real story a huge reggics That's stopping your application for working. So that's it APM APM for the win APM Do you know the terms? Okay. Thank you for that. Do you know the terms? Maybe

20:13

Black box monitoring and white box when I'm telling you've heard of it. Nope Okay, the funny thing is you know about terms like you know what they are both of them But you just don't know that these are those names, you know, okay I'm sure you know them both. Okay, so white box labeling

20:30

Living white box labeling white box monitoring. You know, I was like white label, you know got all mixed up. So white box Monitoring means that you know what you are monitoring like you just said the CPU memory Blam dieros and everything that you know and monitor and then you get alerts according to that Black box monitoring and then we'll probably do a double click on that Can you guess what it is? I'll give you a clue. Do you know black box exporter?

21:02

No, I don't think so. Okay, so the black box. Okay, so I think it's even easier to explain with an example, right? So when you check an endpoint like it's like saying I know that if I query my application at slash Health cellville not just health, which returns. Okay, but that's a health cellville It will return something and I know that this check does a lot of things behind the scenes, you know

21:27

This white sub black box. So then I know everything is okay. Okay, so black box exporter is one of those Prometheus exporters where you can just provide you or else of your application and then mimic maybe Your end user's usage of the application to some of the APIs to see if your APIs are healthy or not So what do you say about black box monitoring like do you use that? Do you love it?

21:53

I should say about that. It's a great idea and it's well I think you brought up a point that In my experience I can't even say 99% because it's really a hundred percent I've actually never seen someone implementing the slash health like they should it's always Return to 100 okay, okay, which means okay. Yeah. Good. Everything's good. If the process is up I'm good. Yeah, it doesn't mean the system can be everything Everything can be information, but yeah, I'm good 200. No worries. Go ahead

22:26

Yeah, this is how it feels right. Everything is okay. Yeah. Yeah, I'm good. I'm good. Don't don't worry about this It's all yeah, so another call to everyone It's hard right. We all build applications just throwing 200 these standard convention if you have the time and you really want to

22:43

Understand what's going on. Please do apply these these checks. By the way It can be also an attack vector because if you like apply a lot of checks and someone just like I think Usually the slash health is something that's probably publicly exposed unless you're not publicly exposing it and that's another story But if it is it means that every time someone queries that you're not only returning to 100 you're doing some

23:06

Backend operations and if someone can attack you and send tons of requests you're probably going to that So it's an attack vector for a DDoS Okay, I avoid that on the right. So do it with instead of doing it with every call Just do it periodically and export that to the 200 the 200 the slash health input

23:26

I should do it periodically on your own. Maybe every second maybe every 10 seconds, but not with every request That's my I thought you'll say something like make it available only internally or something like that So maybe health that's gonna be like public and the major checks should be internally checked You know or another or if it's a third Yeah, yeah, no you're totally right

23:48

But if you're using something external like ping them for example, you know ping them. Yeah, or not even ping them like the status

23:55

Port of your company public time robot. Yeah. Yeah, exactly. They all query the slash health of your application So I think you usually do want to have them public because If you're monitoring something from within your system and the system's down, it doesn't mean anything right Doesn't mean crap say you want something external monitoring you and for that you need to export the slash health What you can do is use different endpoints

24:16

You can use one for an internal checkup and then one for the external

24:19

Which can always return to 100 right whenever AWS can be burning you go to the status of dashboard. Yeah. Oh good. No worry mate so Yes, yes And for that you can also use a security group Knuckles if you want to whitest specific IP addresses or maybe Sider lane ranges ranges of subnets, you know because usually those third party providers give you like these are our Subnet ranges you can whitet least those and then you can check us and give us those

24:48

Endpoints, and we will check it and blah blah not my preference, but yes, you're sure can Okay, just an option just an option so black box monitoring you also touch that so we touched monitoring the we touched logs We touched right collection monitoring analyzing APM Black box and white box that the concept that you mentioned for monitoring I'll stop it a little bit on some tools and applications that you can use how to build a health endpoint

25:19

Pretty much anything else you want to add. No, I think we can write a book about everything we said

25:24

Okay, so ready for the experience corner of the week the corner. Yeah. Yeah. Yeah Okay, you want me to start or do you want to start you start Okay, I'll give you a very very weird So now you know, okay, if that happened to me this week I didn't know about your favorite language or that type script node Okay So I use make files to automate stuff even for that ops 4C actually even though it's not even in C++

25:55

I just love make files, you know, we'll talk about that. Oh, you'll you'll find manual love. I have a new love So in that make file when you use production environment or when it's production

26:08

Then the node end environment valuable. You know this capital Atlas nodes is production Now when you do apparently I didn't know that when you do yelling stall And while the node end is set to production It will only install your dependencies And not your dev dependencies so you can't actually build the app So I didn't understand why my CI CD is failing because you know because the make files set this environment valuable

26:42

I was like, okay, it's walking on feature benches. It's walking on development It's working like on and anything so well when I say production I mean staging Okay, so when we move to staging where it's like the same as production Why isn't it walking, you know, so we were stuck on staging for nothing, you know It's like it's no reason at all. I have to say it's a good thing

27:02

You do because that means you're managing dev dependencies. It means you're working correctly. So that's a good failure by the way It was so annoying But you know to figure out why is yarn installing only then you have to go to the yarn docs and realize that if you set this environment It's only blah blah. So that was my great experience this week. What was yours experience or tool or whatever? Okay First of all comment about what you said make is great. I love make. There's a better better

27:31

Maybe alternative written in rast. It's called just so just look for just on GitHub later I would just look at it. It's just easier syntax and it makes more sense without the weird shenanigans of make

27:43

Two things one. I just discovered recently that if you want to query your cloud trail logs on AWS Up until today if you don't like the standard way of how they are presenting it what you and you obviously want to run some Complex SQL queries you'd export the entire thing to an a tina table and then run it there AWS started started. They released a new service. I think this month on January 2023 and they call it

28:11

A cloud trail lake, right? And cloud trail lake is Kind of a pretty built a fina table with your data source everything is ready to you All you have to do is just put the SQL query inside whereas before you'd had to create a database and then state a model so I know and build the table and a Shit anyway Trail lake is amazing. So that's the first thing other thing. We all write bash scripts or day long You do

28:41

No way around it. You know if you build CI and process it in containers Okay, we do we do okay, so So I found the cool like you have framework to build CLI's in go for example you have cobra in I don't remember Python has its own thing Bash didn't have anything and I found recently that it has a project that's called Bashly and Bashly is a framework for building CLI's in bash and it's very easy. You just get a YAML you state whatever commands you want the flags the Arguments maybe you have some

29:12

Common library etc etc and Bashly does it for you. You write bashly generate it generates the code You can make changes rerun to generate file and it's incredible So you got me shocked there. I don't know if you notice this is the first time I'm googling while we talk Okay, you know I am I shocked because you build something like that I Like I don't know two years ago Something which is called bugs

29:38

I don't know if you remember I think I talked with you about it. Yeah. Yeah. Yeah. Yeah, it's on play mode Bob remember yeah, and now I see Bashly and I'm like wow Wow, awful first It looks amazing by the way. I just figured this in Israeli dude We have something with writing stuff, you know, yeah, yeah, I did right now. It looks amazing so Bashly Okay, yeah, it's like Bash and Li and I said we'll take a look into it. Yeah super cool

30:07

Okay, okay, got me shocked over there. Sorry for that. You know for my shock phase inside and say it was like what there's another Framework what's going on, you know, but how do you surprise surprise? Have you used it? Yeah, I did And it's like just to understand like if it's something that you install on a machine you install the cell

30:27

You install this. Yeah, you write the YAML every time you make changes to the YAML or your code You just write Bashly generate it updates the Like the final script you can add things like a common libraries like utils you can add tests You can add all kinds of stuff and it it runs everything for you like the checks Normalization input validation Basically saves your lives from tons of bugs you have in bash because you never run tests or validate anything

30:54

Okay, so I think like the main difference from what I wrote is like that It like what I wrote is written in pure bash, you know So you just load the file that I you know my file and then you can use it So I think what you wrote is basically helping you kind of ramp up to the stage where you have arguments and flags Which is very annoying to write exactly and this actually brings you to the next level with the tests and

31:18

Standards and everything you have around input validation and yeah mine is like more of a wrapper script and this sounds like more of a real Framework, you know, which is cool. So I'll definitely check it out Okay, talked about 10 minutes about Bashly. Anyways, I'm gonna Thank you very much. See you next week. Thank you. And thank you everyone. Yeah. Bye. Bye. Bye

Transcript source: Provided by creator in RSS feed: download file

#12 - Observability

Episode description

Transcript