The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652 | Ruby Rogues podcast

Speaker 1

00:05

Hey everybody, and welcome to another episode of the Ruby Rokes podcast. This week on a panel, we have Luke Stutters. Hello, we have Dave Kamia. Hey everyone, I'm Charles Maxwood from dev chat dot TV.

Speaker 2

00:17

Quick shout out about most valuable dot depth. Go check it out. We have a special guest this week, and that is Paul Zeich.

Speaker 3

00:25

Zeich, well done, thank you.

Speaker 2

00:27

Now you're here from Checker.

Speaker 1

00:29

You gave a talk at rails komf about how you broke stuff or somebody broke stuff.

Speaker 2

00:35

Do you want to just kind of give us a quick intro to who you are.

Speaker 1

00:37

And what you do, and then we'll dive in and talk about what broke and how you've figured it out.

Speaker 3

00:42

Sure.

Speaker 4

00:43

So, I've been a software engineer for about ten years. Recently, in the last year or so, transitioned into an engineering management role. But I've worked at a number of different small startups. I and joined Checker in twenty seventeen when the company was at about one hundred employees thirty engineers. Contributed as an engineer for a couple of years to our team, and then have recently transitioned, like I said, into a jerry management role at the company.

Speaker 2

01:09

Very cool.

Speaker 1

01:10

I actually have a Checker T shirt in my closet that I never wear. It's check R for those that are listening and not reading it. Yeah, So why don't you kind of te us up for this as far as yeah, what happened, what broke? Yeah, give us a sort of a preliminary timeline and explain what Checker does and why that matters.

Speaker 3

01:29

Sure.

Speaker 4

01:30

So checker Checker was founded in twenty fourteen. Daniel and Jonathan, our founders, had worked in the on demand space another company and had discovered that is very difficult containing great background checks into their onboarding process. Background checks tend to be a very important final safety step for a lot of these companies to make sure that their platform is going to be safe and secure for their customers, and so in twenty fourteen they started an automated background check company.

02:00

And initially the biggest selling point was that Checker abstracted away a lot of the complexity of background check process, collecting candidate information and then executing that flow and exposing that via an API that was developed in a sinatrap. And three years later, in twenty seventeen, I just joined about four or five months, four or five months before this particular incident happened. Fast forward to that point. We're running i'd say, a few million checks a year for

02:32

a variety of different customers. Most of those customers use our API, like I said before, to manage that process, and they do most of the collection and interface with the candidate on their side in their own application.

Speaker 2

02:46

Oh that's interesting.

Speaker 1

02:47

Yeah, I think a lot of the background check portals that I've seen they yeah, they're like the fully baked portal instead of yeah, being a background service that somebody else can integrate into their own app.

Speaker 5

03:00

Did yeah exactly that ground check on me?

Speaker 4

03:02

Before this episode, I did not that there are a lot of very important guidelines in stipulations governed by the Federal Credit Reporting Act that make sure that you have to have a permissible purpose for running a background check. So in this case, most of our customers are using the permissible purpose around employment as the reason for actually running that check.

Speaker 1

03:25

Well, that's no fun, I know, right, I want to know everybody's dirty secrets. Interesting, So, yeah, why don't you tell us a little bit about what went down with the app?

Speaker 2

03:36

Right?

Speaker 4

03:37

So, like I said in twenty seventeen, Checker at this point was a pretty important component of a number of customers onboarding process. But we'd started off small and things grew quickly in a lot of ways. We were just trying to keep the lights on and scale the system along with our customers as they continued to grow. On demand is growing a lot this time as well. So in twenty seventeen, we were doing some fairly routine changes

04:06

to a data model. I wasn't directly involved with that, but we were changing something from an eteger ID to UID and the references, and there were some backfills that needed to happen, And so an engineer executed a script on a Friday afternoon, which is always a great idea, and they executed a script at about four thirty pm, probably went and grabbed something, had a little happy hour,

04:33

and then headed home. And about an hour later we started to receive a few various different pages to completely unrelated teams that didn't really know what was going on in terms of this backfill, and it didn't look like anything too serious. It was just an elevated number of exceptions in our client application that does some of the candidate PII collection, and so we just decided to that team decided to snooze that decided to just kind of ignore it.

Speaker 1

05:01

Yeah, so people that aren't aware. PII is an acronym for personally identifiable information and is usually protected by a law.

Speaker 2

05:08

Thank you anyway, go ahead.

Speaker 4

05:10

So come Saturday morning, this has been going on for about twelve hours now, this exception comes in again, and at that point someone on our team actually decided to escalate that and get more stakeholders involved. We had some variety of other issues going on. We just migrated from one deployment platform to Kubernetes, and so we had some

05:32

issues getting onto the cluster. There are too many of us trying to get on at the same time, so we ended up all having to actually go into the office to the physical internet to finally get in debug the issue. So we had a couple of other confounding issues come up at the same time that made the

05:48

process of response even worse. So finally, this is maybe ten o'clock in the morning, ten or eleven in the morning, we finally, after being able to take a look at that, identified what they issue was and we were starting We're responding to about fifty to sixty percent of the one of the most critical end points on our system, which

06:08

is to actually create requests to make a report. So after you've collected the candidate's information, you say, please execute this report so we can get that back, and that's a synchronous request that you make using our API. And when that request was failing, it was failing about forty to fifty percent of the time with a four or

06:26

four response, which isn't really expected. So at that point we were finally able to pin down the issue and it came back to this script, and it turned out that when you went to create this report, we would look for create these additional subobjects called screenings, and due to the script, we had actually created an issue where validation would cause.

Speaker 3

06:46

The reports to fail to create in this EDU.

Speaker 4

06:48

Case, So there's some confounding issues with the way that we had set up the data modeling to begin with that we were trying to work around and this acception happened.

Speaker 3

06:57

But when we.

Speaker 4

06:58

Finally fix the issue, that where we shifted more into what could what actually went wrong and what were the real issues that caused us this outage gotcha.

Speaker 1

07:08

So I'm curious as you work through this, what did you add to your workflow to make sure that this doesn't happen again, because I mean, some of it's going to be technical, right, it's testing or you know, maybe you set up a staging environment or something like that, and some of it is going to be hey, when this kind of alert comes up, do this thing right, because it sounded like you did have some early indication that this happened, right.

Speaker 4

07:31

So I think the first most important thing that we did was that are really from the beginning, we have had what we call a blame this culture. I think

07:39

it's a common term now in the industry. But the idea there is to really focus on learning from issues, not trying to find who made the particular mistake, and trying to look at what processes you're missing and what changes you need to make in just your code base as well that would have prevented the problem from happening, so not trying to focus on the individual mistake.

Speaker 3

08:03

That was made. So as part of that, we did a.

Speaker 4

08:05

Post mortem doc and we went through and identified things like one we should really have like a dedicated script repository that goes through a code review process. So that's that's one thing we implemented, and we made some safeguards and to address this particular issue with the data models as well. But I think for everyone that the bigger issue was really the fact that we missed the outage

08:30

for so long. And we did actually have some monitoring in place for this particular issue that would have that should have page for the downtime that we were experiencing for some report creation, But it turned out that our monitors were really just not set up in the most effective way to trigger for that particular type of outage in this case is a partial outage, and that's that requires a much more sensitive monitor in order for us

08:58

to detect everything. We designed the porehand was much more targeted towards a complete failure of our system.

Speaker 6

09:04

And so was this something that could have been caught by automated tests.

Speaker 4

09:09

This particular issue most likely could not have been caught by an automated test because it was a it was so so outside of the norm of what we expected data to look like. So we had particularly I mean, we had of course unit tests for everything that we were running, and we had requests specs as well. We did not have like an end to end environment set up for like a staging environment where you could run these tests and to end.

Speaker 3

09:36

But again, the data in that this particular.

Speaker 4

09:40

Case was very old, and it was essentially doing a migration where that data was in a state that wasn't anywhere in our code base at this point, So I'm not sure we could have anticipated this particular issue.

Speaker 7

09:52

What was the Fordland? Did everyone like phone up and get really angry?

Speaker 3

09:57

Oh? From a customer's perpective.

Speaker 7

09:58

Yeah, yeah, this is the the best bit of outage stories is the kind of the human cost of whoever has to answer the phone the next week code drama.

Speaker 4

10:06

Right, That's that's always one of the especially as your as your application becomes more important to customers and what your service, the impact to customers is more and more extreme. And so in this case, piezas a Friday night, it wasn't something where a lot of our customers were actively monitoring on their end as well. Fortunately, we were able to see that retries were happening, and many of our customers use a retry fallback mechanism, so they were able

10:35

to just allow those to run through. But this is particularly tricky in this case because there wasn't actually like a record idea for many of these these particular responses. Fortunately, we did have a. We do keep API logs, so we were able to see exactly which requests failed for each of our customers, and so we were able to then reach out to our customer success team and they were able to start to share the impact with each

11:01

of those customers pretty quickly. I will say that we've done a lot of work to make our customer communication a lot, a lot more polished since then, and that's something that we're really focusing on now as well, and just being able to give more visibility customers sooner. And one of the most important things there is when it comes to monitoring, is that you really want to be able to find the issue and be able to start to investigate it before you You don't want a customer to

11:27

identify it first. You should really understand what's happening in your system before anyone else detects that issue.

Speaker 7

11:33

And I guess, but this specific not this specific product, but kind of product where your customers are consuming your API. You're also at the mercy of their implementation too, so you know, making a kind of called against you, and if that call is failing, you know, you've got to hope that their system can cope with that as well.

Speaker 4

11:55

Exactly if some of these requests are happening in the browser, or we're not set up to automatically really try, that could be a much worse impact on the customers.

Speaker 7

12:05

Can we talk about the blameless culture for a bit. This is this is a new idea. And when I was managing engineering teams, I used to have what I called the finger of blame. So I used to around I would hold up my finger in the meeting and I'd introduce the finger as the finger of blame, and then we'd work out who the finger of blame should be pointing to. Now, more often than not, of course

12:29

it was me. So the finger of blame was a double edged finger, but it was it was a kind of way of you know, people take it very seriously when they mess up the kind of stuff, so you kind of have to get your get your team back on board. So it's a way of kind of lightening the mood after after that week's disaster. But a blameless culture, as you said, is a kind of more more sophisticated way of doing it instead of pointing a jovial finger at the person who messed up. What's what does that

12:58

look like? I mean, you know, do you just go around telling people it's not their ful or you know, how do you implement a blameless culture in what sounds like quite a big engineering team.

Speaker 4

13:09

I think I think it starts for us with it really started with our CTO Jonathan and co founder making that a priority from pretty much day one, basically from the beginning of our process. When we've had issues or incidents, we've done a post mortem doc, we've had a process around that, and it's always been very forward facing, very much about what could we have done better, what can we improve, what are the things we should be doing

13:38

going forward. So I think having that first touch point and really having that emphasis from the beginning was really important and cascades down. I think as you're building out a bigger engineering team, that's critical is to be able to just continue to build keep that culture going. And I think that's that's something a challenge to continue doing. But I think as we grow and we've been able to do that so far, so I think that was

14:03

step number one. I think a second piece of it is understanding and trying to understand when it's more of a process issue versus something that someone particularly did wrong. And I think a lot of the time. I think a lot of incidents do occur because you're trying to you're trying to make different prioritization decisions, and you're trying to make sure that you anticipate things in advance or

14:26

failure ferialments, and sometimes you just miss those. And those are particular cases where I think the management team needs to really take responsibility for it. It's not an individual issue that caused that particular downtime or that that was necessarily that one piece of code, and so it could be just an example. Is I mean, this is an example I think actually where we had some technical debt.

14:50

We were trying to clean it up, and that was a good thing, but I think we didn't necessarily have everything in place to be able to address that technical debt effectively, and that's not necessarily one engineer's responsibility to be out in front.

Speaker 1

15:03

Of Yeah, one thing I just want to add is that I like the blameless culture just from the sense of unless somebody is either malicious, which I have never ever ever encountered, or is chronically reckless, which I've also never encountered. Right, everybody is usually trying to pull along in the same way you know, if somebody has that issue, you identify it pretty fast and you usually are able

15:26

to counter it before it becomes a real problem. But yeah, just to put that together then you know, yeah, the rest of it, it's, hey, look, we're on the same team. We're all trying to get the same place. So let's talk about how we can do this better so that doesn't happen again, because next time it might be me, right, that misses a critical step. And I don't want you while fingering me either. I mean, I want to learn from it, but I you know, we don't want people

15:54

walking around in fear. Instead, if somebody screws up, we want them to come forward and say, hey, I might have this up before it becomes an issue next time.

Speaker 3

16:02

Absolutely.

Speaker 4

16:03

And I think one other thing to highlight here is that when you don't have a blameless culture, folks are going to be very afraid to speak out when they do soon an issue, whether it was there they think it was their mistake or someone else's, They're not going to want to escalate that issue and make sure that it gets attention necessarily. And so one of the best side effects of having a blameless culture is that you get really engaged response and everyone's going to work together

16:31

to try to address the issue. I think that even cascades down to customer communication as well, because when you're really engaged in trying to do that, then you're doing the best thing for the customers as well, because you're trying to address these issues head on and not try to find ways to kind of smooth them out under the surface.

Speaker 1

16:50

Yeah, it also and this is important, and sometimes I think people hear this and they're going to go that sounds a little scary, But you want people to take chances sometimes, right, you want people to kind of take a shot at making things better. That opens it up to them to do that, right, it's oh, well, you know, I tried this tweak on the Jinkins file, or I tried this tweak on the Kubernetes setup, or I tried this tweak on this other thing, and a lot of

17:18

times those things pay off. But if you don't give people the freedom to go for it, a lot of times you're going to miss out on a lot of those benefits. And again, as long as they're not being reckless about it, right, so they're taking the steps, they're verifying it on their own system and things like that. Then you benefit much much more from people being willing to take a shot. So yeah, so with the blameless culture, I'm curious. So you get together and you start identifying

17:41

what the issue is. So what does that look like then as far as figuring out what's going on, because you're not pointing fingers, but you are looking for the commit that made the problem, right you are.

Speaker 4

17:53

I think at the end of the day, you're going to try to find the root cause. Right, You're going to look for that commit, You're going to look for the law. Maybe it was a script that was logged into your logging system, whatever it is, You're going to

18:06

look for that and look for the root cause. So honestly, a lot of times, you know, maybe what caused the issue from whether if it was something that was specifically run by a specific person, and they probably feel a little bit of guilt there, but there's no reason.

Speaker 3

18:18

To lay on more there.

Speaker 4

18:20

And I think everyone, like you said, feels a lot of responsibility around the work that they're doing already, so there's no reason to overemphasize that. So what that looks like is typically the team that is impacted is really going to own that post mortem, and that's one way for you to feel like you're resolving the incident or that they issue that the cost incident. So this is a definitely become a different a bit of a different process as the team is growing.

Speaker 3

18:48

When we're at thirty.

Speaker 4

18:50

I think it's a little bit easier just to know exactly who should work on those types of mitigations. It

18:55

doesn't typically it's pretty isolated to a specific team. As the team has grown, growing and the system is growing, that's definitely become more of a challenge because sometimes incidents happen because different issues that multiple teams have introduced, or maybe there's multiple teams that need to be involved in the mitigation and for that in that case, we've definitely been trying to involve our post mortem process and the

19:17

action items. So we have a program manager that one of her responsibilities is specifically around making sure that we are coordinating some of those efforts and meeting some essays. So we had to some additional rules and coordination around the process as we've as we've started to grow, a lot of it was just on the individual teams initially, and now as we've grown again, there's more processes involved. I think that's pretty common thing that you have to introduce as teams grow.

Speaker 7

19:48

I will say that if you've got relatives who are in the medical profession, especially if they're pathologies, even the use of the term person posts more toem makes me uncomfortable because those are no fun at all. But yeah, it's also a word that we use. So yeah, it's oh, it just makes me. Oh, it's creepy. It's all zombies.

Speaker 2

20:11

I don't know.

Speaker 7

20:11

Yeah, the post mortem brings me flashbacks to episodes of The X Files in the nineties when Dana Scuddy was taking a Navy in apart.

Speaker 1

20:20

Yeah, but it does give you a little perspective too, right, because usually in our post mortems, we're talking about what went wrong with the system, not that somebody actually died because of this, Right, I.

Speaker 5

20:31

Just got a weird brain a Right, that's what my brain thinks.

Speaker 1

20:34

That, well, some software it is life supporting, you know, a lot of the medical equipment and stuff out there. But you know, in this case, yeah, we all want to keep our jobs as well, so I mean, it's not like we can just blow it off either. So yeah, So I want to get back to the topic at hand, though, and talk a little bit about what kind of monitoring did you have before and what kind of monitoring do you have now in order to catch this kind of thing.

Speaker 4

21:01

So we use a number of different types of monitoring. At the time, we used a lot. We were pretty heavily reliant on exception tracking, and we also had some application and performance monitoring as well, commonly called APM. A couple examples of that would be something like new Relic or data Dog as a product as well. Now and then we did also use a stats D cluster that sent metrics over to data Dog, and I think we just had started using that maybe just a few months

21:33

before this particular incident occurred. So, like I alluded to before, we had some We had some monitors for this particular issue, but they were pretty simplistic. They basically just looked for a minimum threshold of the number of reports that we're creating, and we had to set that threshold.

Speaker 3

21:50

To be very low over like an hour period.

Speaker 4

21:53

Because traffic is variable, you never know exactly how many reports you're going to get created. There's times a day where we received very few requests, and then there's other

22:02

times where we see large spikes. So we just had very simplistic monitoring in place for some of these key metrics at that point, and at that point we were still very heavily reliant on, like I said, exception tracking using systems bug trackers like Centry that then could then alert if you had certain thresholds of number of errors

22:23

over a period of time. In this particular case, exception tracking isn't very useful because we were responding with a four or four, so there wasn't actually there was an exception in the system.

Speaker 2

22:34

It was just.

Speaker 4

22:35

Automatically active record not found something like that that was then handled automatically and then responding with the four or four. So it wasn't expected behavior, but there wasn't an exception that could have been caught.

Speaker 1

22:47

Yeah, that makes sense. Somebody typed this question in it was one of the penelists. Did you get that answered? I don't know if it was Luc Dave.

Speaker 5

22:54

It was me. Just be clear, was this instant Was it a monitoring problem all an alerting problem because it sounds like an alert did go off at some point.

Speaker 6

23:06

Sounds like it was a people problem because they snooze the alert.

Speaker 4

23:09

I think this was more of a monitoring problem overall. As Dave mentioned, there there was a component where.

Speaker 3

23:18

A page was met was snoozed, But I.

Speaker 4

23:21

Think that was still a failure on our on our monitoring because in this case that was just a signal of what the true issue was. It was a downstream client application that had had a page earlier on and it wasn't It wasn't clear at all what the issue was. And I think when you're when you're developing a system

23:45

for alerting, you need to have clear action items. So you need to have and that's where custom metrics, building application metrics as you as you grow become really important, having the having clear signal what wrong so that that's someone knows where to investigate. In this case, it was a client application and browser. There's a lot of noise there and I can easily understand why someone would just

24:12

snooze something like that. In my opinion, it wasn't really a people issue in this particular case.

Speaker 6

24:18

Yeah, I think we've all been there before where we get an alert from whatever monitoring that we're doing and the error looks serious, but you kind of read it and like, oh, you know what, this is probably just a one off situation, and then turns out it is actually a big deal that needs to be addressed as soon as possible. So I no, I've been there before, and you know the hard times to really track this. I use Century for my air tracking and so I

24:53

get email text notifications with that. And one of the nice things about it is that it'll show the number of occurrences, whether they are unique or not, so I can see if okay, this particular error is only coming from one user, or I could see we're getting one hundred errors that's coming from one hundred different users, so

25:16

there's a more widespread problem. So I think, you know, definitely getting the notifications, but then having proper analytics on your errors so you can actually see the scope of how big this is can really kind of weigh in on the importance.

Speaker 5

25:33

Yeah, makes sense, I imagined, Dave.

Speaker 7

25:36

You've been through, like me, many different monitoring platforms data Dog, you said, new relics. You know what which the good Which are the good monitoring platforms for which one? So you're like, this is the platform that works really well for this API situation.

Speaker 6

25:54

I think it all depends on what you're doing. So if you have a heavy jobscript front in kind of deal, and if you also have a lot of rev backing code. I know Centric you can handle both of those situations. Other people will go with another solution. So I personally found centriy to be math flavor of choice, but you know, mile edge will vary based on what other people have.

Speaker 4

26:22

It also depends on where where you are in terms of your applications, use cases, what customers, what the customer profile looks like, how large the company has gone, how many people are supporting it. When you're early on, when you're building a new application, new product, by definition, the developers on that are going to really understand the full

26:45

system very well. So cent exception tracking probably is going to be able to give you most of what you need to know in terms of understand what's going on. As the system starts to grow, and especially as you have more great teams, I think that's where things like stats D become more useful because you need to be able to set up specific use cases for core parts

27:10

of your application. And I would maybe say that the bar there is maybe when you start to hit the point where you start to have a significant number of pain customers using specific features, maybe you need to start to hone in on one or two key processes that they break, it's absolutely critical that you know immediately. That's kind of the point that Checker was out in twenty seventeen.

27:30

We really needed to have high intelligence, our very clear intelligence and visibility into specific parts of our system, and we're trying to move in that direction. When the sincident happened. We've continued to invest in that area going forward. I think it's become even more important as we're getting larger because there's just so many different systems that are interacting together that no one really understands the whole system at

27:55

this point. And the only way to really know how the different systems are working together is maybe make sure everything's working properly. Is to have some of these custom metrics to find for specific key processes.

Speaker 5

28:08

Do you find that putting really large screens on the office wall helps make your application more reliable?

Speaker 3

28:15

That's a good question. We don't.

Speaker 4

28:16

We're all remote now, so at this point, having had an experiment with that, we did have some of those in our office. I think I've been trying to find ways to make that more visible and make metrics more visible to our team as we've been and shifted to

28:31

one hundred percent remote due to the pandemic. There's also a challenge for our business in particular where sometimes things are very many of our processes are very asynchronous and they could take hours to date to fully execute, and so finding ways to short circuit and know that those things are broken can be challenging at times. So one of the things we have to do is we have to look at the data over time as well and

28:56

not just look at real time metrics. So one thing I've been experimenting with is trying to create more automated reports that go into sort of a Slack channel that we can look at and so people can review that.

29:07

And we've also implemented a basically a bi weekly review during our retro where we just look at our metrics and some of the longer, longer running trends so that we can see if those look correct, is there anything that's wrong, We can talk about it, see if there's things that we want to actually action on based on that review. So we're trying to find some ways to do check ins that don't require us to be all in office.

Speaker 7

29:30

The Slack channel truly is the Giant Performance monitor of twenty twenty that is that is literally what tells me whether stuff is working in a moment. I'm thinking a lot of people in the same boat. So it sounds like that you were saying that once you get to a certain stage, then the office shelf monitoring isn't really going to cut it. So you have written custom monitoring for your application.

Speaker 5

29:55

Is that correct?

Speaker 4

29:56

We have implemented what i'd consider customer tricks. We use Data Dog, so a lot of this is out of the box. You can use their implementation, but you're you're adding some code just the parts of your application. Maybe it's a maybe it's a callback on your active record model. When something is created, you send a message to a queue and then that triggers over a message into stats D that goes to data Dog. Anyways, you can do. It's a pretty lightweight to implementation in terms of what

30:29

you can do. But you're adding specific events that you want to track, and then you can you can create your own monitors and alerting around those or correlations between different different events in your system. So you could potentially look at a custom metric and then look at that compared to HTTP statuses that are coming through or the latency of an endpoint and then you could correlate those

30:52

two metrics as well. So there's there's some more advanced things you can do there as well if you need to. But again it's not really a lot of custom work. Is just adding some specific points in your code bas that you feel like are really important to truck. And one example of this for rails users is I believe there's something like this already set up for data Dog for sidekicks.

Speaker 3

31:13

So we instrument on a lot of our.

Speaker 4

31:15

Psydekick jobs and we can see when the log is growing on one of those cues, we can see what the average completion time is the p ninety completion time for different types of jobs. So you get a lot of visibility into your ssidechick workers and processes very easily, basically for free.

Speaker 6

31:33

And if you're going to use Slack for your error notification, now I'm not dossing that at all. No, I have a few applications that actually do that. It just triggers a Slack notification. But if you're only capturing the error message and not a stack trace along with it, then that error message is pretty much useless because it tells you you have a problem somewhere in your millions of lines of code, but we're not going to tell you where set.

Speaker 4

32:00

Just to be clear, we capture all of our our errors in Century. We do have some alerting because of Slack. But I would also want to emphasize that anything that's truly has any chance of being a serious issue should never be like an either an email or a Century alert or sorry, a Slack alert. You really should have some kind of escalation via either maybe it's text, maybe it's an actual incident response system like Peter Duty where you can have an escalation policy.

Speaker 3

32:33

For us, that's what we're using.

Speaker 4

32:35

It should have this synchronous alerting that really forces someone to look at it. You can't rely on something asynchronous like Slack in this case for serious response on issues.

Speaker 2

32:46

There's a little off topic.

Speaker 6

32:48

But you know what issue I found with that is I use my cell phone for everything. It's where I have my email, get my text messages, phone calls and all that stuff. So I would like to keep it on full volume late at night when I'm sleeping, so if a critical does arrive, then I can get notified.

33:08

But my issue is that I would never get any sleep because my phone would just go off so I need to figure out some way that I can set up for a particular phone number or something to override any kind of sleep mode or whatever that I have on my phone right now, or get a different phone for that purpose. That seems a bit overkill.

Speaker 4

33:30

Doesn't You can actually do that. You can do that, I believe with at least with DIOS. You can set up an override where you snooze everything else, and then you can set up and you have to just put it in your personal contacts, whatever numbers you think you're going to receive critical notification from, and then that'll actually ring through it.

Speaker 6

33:47

All right, I need equip being lazy then and just do that.

Speaker 7

33:50

Back in twenty fifteen, I was working in the States and due to various issues, I was still responsible, thankfully for a bunch of service in the UK, and I'd gone to see a film and put my phone on silent, and of course all the servers melted halfway through Skyfall or whatever movie it was. Tom Cruise did not alert me of the impending server disaster while he was dealing with the aliens. So I came out and everyone was very upset. So I ended up writing custom alerting with

34:20

a custom app. Were using the Android Automator that when it received a text message with the magic string in it would actually like manic turn turn up volume and then play the Beatles help at full volume. And that worked. That worked very well. But what didn't have, which I like on the page of duty system, is the acknowledgement so you can see, you know, yeah, I've sent the message. Has that person seen that message? And you know tapped the yes I am aware server as a melting button.

Speaker 1

34:55

Yeah, I've got I think it's the bedtime settings in iOS, and yeah, I've just told it. If it's a number in my contacts, then ring and if it's not, then don't. So yeah, it'll go off, but it'll only go off if it's yeah, if it's in my contact So yeah, then i just add whoever or whatever to my contacts and I'm set.

Speaker 6

35:13

Yeah that should work well for my use case because no one never calls me.

Speaker 2

35:17

Yeah.

Speaker 5

35:18

Right, So that's a tragic thing to say to you.

Speaker 3

35:22

Now.

Speaker 6

35:22

I had the Verizon call filter, which actually works pretty well. It's reduced the fifteen to twenty phone calls. I will get a day down to like one.

Speaker 2

35:32

Yeah, the iPhone has that feature.

Speaker 1

35:33

To where you can essentially tell it don't ring unless the numbers in my contacts.

Speaker 2

35:37

Yeah, I got.

Speaker 6

35:39

Burned by that pretty bad. One time. My wife was over the polls. She had forgotten her phone or she had lost its phone there, and because that random person wasn't in my contacts, I never got her phone call.

Speaker 2

35:54

My phone just stayed silent.

Speaker 6

35:55

So I had to disable that pretty quick.

Speaker 2

35:57

That'll teach you.

Speaker 7

35:58

Can I ask you about composite monitors, because that is a phrase I haven't heard before. I'm submiled with a rate monitor, and my understanding that is if it drops really quick, it goes off, but if it drops slowly, it doesn't come off. But what is this composite monitor?

Speaker 4

36:16

So composite monitor is basically a combination of several different metrics that you're measuring using, chained together those with and.

Speaker 3

36:26

Or or statements.

Speaker 4

36:27

So maybe referencing what I was talking about before, or you might want to have a custom metric that you're looking at and you want to look at how many of those are coming through, how many events are coming through?

Speaker 3

36:38

And then you might also want to look at, in.

Speaker 4

36:40

This case, the air rate for HTTP status, maybe how many four hundred errors you're getting relative to two hundreds. You could basically do something where you have an end statement between those two different measures and those bullying evaluations, or you could do something where you have an ore so you can say, these are basically signaling for the same type of issue that I want to alert on, but I'm going to look for these different conditions all in the same monitor.

Speaker 7

37:09

So you look at multiple different things at once. Is that so that you could combine those to kind of set effectively a much lower threshold and get higher signal to noise. So you say something like, you know, well, well allows some number of four row fast, this number of server load, this number of other errors. But if you get all three at the same time, then it triggers something different, or does it use a lower number?

Speaker 5

37:36

What's the what's the result of that?

Speaker 7

37:39

The advantage of using that logic instead of just saying, here is the minimum number of four row flaws.

Speaker 5

37:44

Here is here's the minimum, here's.

Speaker 7

37:47

The maximum number of fours, here's the maximum number of errors. How does that actually translate into a better metric?

Speaker 3

37:55

Right?

Speaker 4

37:56

So, I think I think it gives you the ability to tune things to make the potentially make something have a higher fidelity of when it alerts, so you're not getting one. You can set the thresholds actually higher and keep things. It depends how you want to use it,

38:10

but you can. In this case, you could set the thresholds higher, but you could have something where it's like, well, if it's all there aren't any errors coming through, then maybe we're okay with that even though the numbers are a little bit lower, or you can do things where you can be more and again you can also tue

38:25

this to be more sensitive. In this particular incident, if we had had some air monitoring around four hundreds in addition to the threshold that we had that was pretty low, I think we would have been triggered on we would have been alerted on that within maybe an hour. So you can do things there that give you more sensitivity without necessarily causing a lot more false alarms.

Speaker 3

38:47

And that's something that.

Speaker 4

38:49

You have to just be really careful with any kind of monitor on a team. Is you really need to make sure that you are not creating false alarms. I'd say it's almost as important or equally important to the sensitivity of the alarm as well, because if you're creating false alarms all the time. It's just human nature to basically start to ignore those or not really give them

39:12

the review that they need. So if you're doing that all the time, you're probably going to miss something inevitably when there's actually a real issue.

Speaker 1

39:19

Makes sense, All right, we're getting close to the end of our time. Are there any other stories or examples or lessons that you want to make sure somebody listening to this gets.

Speaker 4

39:30

I just want to emphasize that this is a growing process that I think every team should go through. It's something that is going to evolve over time, and as your product becomes more important to customers and can use and grow, you need to just be constantly revealing what your approaches to this. What's going to work for brand new product, brand new startup, brand new company isn't necessarily going to be the right fit.

Speaker 3

39:58

As you continue to.

Speaker 4

39:59

Grow and something that you need to evaluate and as your product starts to be something that's really a critical service for your customers or for other teams at your company, you just need to continually set the bar higher and make sure that you're continuing to grow observability across the stock.

Speaker 1

40:17

All right, Well, one more thing before we go to picks, and that is if people want to get in contact with you, how do they find you on the internet.

Speaker 4

40:23

You're welcome to reach out to me on Twitter at Kyzeitch, or you can reach out to me on LinkedIn as well.

Speaker 2

40:31

Awesome.

Speaker 1

40:31

Yeah, we'll get links to those and we'll put them in the show notes. Let's go ahead and do some piccks then, Dave, do you want to start us off with the picks?

Speaker 2

40:37

Yeah?

Speaker 6

40:37

Sure. So went to the doctor the other week and they said I had high blood pressure, which I attribute to raising kids and them stressing me out. So I got this blood pressure monitor that syncs up with my iPhone so it keeps a historical track of it. And it's been really nice, and I guess it's accurate. I don't know. It says it's highs so I guess it's doing something. So it is the withings and it's a wireless rechargeable blood pressure monitor.

Speaker 2

41:09

Cool, Luke, how about you.

Speaker 7

41:11

Us as a really interesting Is this something you wear all the time day?

Speaker 6

41:17

No, it's just like the doctor's one where they put it, roll up your sleeve, put it on your arm and you know it starts to squeeze your arm. It's not like a wristwatch or anything. So I do it a couple of times today.

Speaker 7

41:29

Blood pressure just kidding, yeah, just just checking it, just obsessing about it. I suppose that's that's good. It's not real time. Other always, that'd be even more stressful because you'd be sitting here and it go off and say, yeah, blood pressure is going up. Get caught in the feedback loop.

Speaker 2

41:44

Cool? How about you, Luke? What are your picks?

Speaker 5

41:46

I've been fighting the code this week, Chuck.

Speaker 7

41:49

I've been building strange command line in the faces in a ruvie, and I've been using a little application which is installed by default on most de Bunty based systems called a whiptail. This is an old school text style interface so when you can't put a guy on it for various reasons, so this is kind of like it makes makes it look more professional, you know, it makes

42:13

it like a real piece of software. And using this from Ruby has been a real pain because you need to do funny things with filed as scriptors to get the user data out. So it turns out a very unnice man by the name of Felix C. Stiguman has written a gem has written a gem to do it all for you in Felix. So yeah, you know, all of that work I did was totally unnecessary, and you too can build amazing old school asci looking interfaces using

42:43

the gem. It's called ef and it's on GitHub on the the od fask and there's loads of really interesting utilties on the odd fast gub. If you dig in, there's an interesting low level stuff for when you want to kind of rudy yourself off on the commonline, say well, well look awesome.

Speaker 1

43:03

All right, I'm gonna throw out a couple of picks. The first one is I'm still working on this, so keep checking in most Valuable dot dev and Summit Dot Most Valuable dot Dev. I think I've mentioned it on the show before, but I'm talking to folks out there in the community. We've talked to a number of people that you've heard of, that you know well, that you're excited to hear from. But yeah, I'm going to be interviewing them and asking them what they would do if

43:24

they woke up tomorrow. Was a mid level developer and felt like they didn't quite know where to go from there. So a lot of folks that's where they kind of end up right, they get to junior or mid level developer, and then it's okay, I'm proficient.

Speaker 2

43:38

Now what.

Speaker 1

43:39

Yeah, there are a lot of options, a lot of ways you can go. I'm hoping to have people come talk about blogging, podcasting, speaking at conferences and all the other stuff, and then just how to stay current, you know, how they keep up on what's going on out there.

Speaker 2

43:50

So I'm going to pick that.

Speaker 1

43:52

I've been playing a game on my phone just when I have a minute, and you know, I want to sink a little bit of time into it. It's called Mushroom Wars two. It's on the iPhone. I don't know if it's on the Android phone. Yeah, liking that, and then yeah, I'm also putting on a podcasting summit, So if you're interested in that, you can go to Podcasts podcast Growth Summit dot co and we'll have all the information up there if you listen to the Freelancer Show.

Speaker 2

44:16

The first interview I did was with Petromanos.

Speaker 1

44:19

She's in Australia, so I was talking to her in the evening here in the morning there, which is always fun with all the time zone stuff. But she talked about basically how to measure your growth and then how to use Google's tools not just to measure your growth, but then to figure out where to double down on it and get more traffic.

Speaker 2

44:35

So it was awesome.

Speaker 1

44:37

I'm talking to a bunch of other people that I've known for years and years in the podcasting space, and I'm super excited about it too. And I should probably throw out one more pick. So I'm gonna throw out gmailias that's g M E L I U S. And what it is is it's a tool. It's a CRM, but it also has like scheduling, so like schedule once or what's the other one. It allows you to set

45:03

up a series of emails. It'll do automatic follow up for you and stuff like that, and so it just does a whole bunch of email automation, but it runs out of your email account, your Gmail account. That's the big nice thing about it is that you don't get downgraded by send grid or something if your emails aren't landing.

Speaker 2

45:23

And so that's another thing that I'm just really digging. So I'm going to shout out about that, Paul, what are your picks?

Speaker 4

45:29

I really enjoyed something that was in the Ruby Weekly newsletter this last week. There's a Ruby one liner cookbook, so it has a bunch of different one liners. You can actually just shout it out to and make those calls, and it explains how you can do a lot of things they do with a shell script very easily with Ruby.

Speaker 2

45:50

Awesome. Have to check that out.

Speaker 1

45:52

Sounds like a decent episode too, whether we just go through some of those and pick our favorites or whether we get whoever compiled it on. Thanks for coming, Paul. This was really helpful, and I think some folks are probably gonna either encounter this and go, yeah, I wish we were doing that, because the last time we were ended something like this it was painful, or some folks.

46:10

Hopefully we'll be proactive and go out there and set things up so that they're watching things and communicating about the way that they handle issues and the way that they avoid them in the first place.

Speaker 3

46:20

It's a pleasure, all right.

Speaker 1

46:21

We'll go ahead and wrap this up and we will be back next week. Until next time, max out, everybody,

Transcript source: Provided by creator in RSS feed: download file

The Sounds of Silence: Lessons From an API Outage with Paul Zaich - RUBY 652

Episode description

Transcript