Error Reporting and Bug Monitoring with James Smith - RRU 278 | React Round Up podcast

Speaker 1

00:06

Welcome to another episode of React Round Up. My name is TJ. Bentol, and I'm flying solo here on the panel today, but that is all right because I have James Smith with me today. James, why don't you go ahead and introduce yourself? Tell everybody a bit of why you're famous.

Speaker 2

00:20

Thanks DJ.

Speaker 3

00:21

Yeah, I'm James Smith. I the CEO and co founder of a company called Bugsnack and Bugsnack detect when software breaks. But prior to running a company and being a founder, I built software in the web or mobile applications in various industries for quite a while. I like to think of myself these days as a retired software engineer and just.

Speaker 1

00:43

Enough to be dangerous awesome. So bug managing software. So I'm going to start with like the like a softball, an easy question, but like why do you why do developers need bug managing software? Like why isn't it enough to just throw it out there and rely on like you know, user reports and QA and that sort of stuff to find these bugs.

Speaker 3

01:00

Were a surprising number of companies that we work with still do that, and they're coming to us to rehabilitate. But I think that modern software development has changed hugely. I joke a lot about how you know, twenty five thirty years ago, when you develop delivered software, you printed it onto a CD or floppy discs and that software was done. And these days most software is running in an environment where it can be updated and fixed and patched.

01:26

And also I think that people adopted more principles like agile and lean, where sometimes you're going to build something that's not ready intentionally and you're going to say, look, we're going to release this to customers early because we don't even know if customers are going to like this yet. And so the concept of keeping and working on something after its shipped to the customers is now the default

01:46

in most companies. In most cases, so you can't have this like perfect five month QA process, print it to gold Master and CD and shipping out to Best Buy.

Speaker 2

01:57

When it's done.

Speaker 3

01:58

You now have this living, breathing piece of so I think that's the main thing that's caused this evolution. I think that most people have now taken to squishing down that QA period and replacing it from both the left hand side and the right hand side. Probably almost everyone from the left hand side is adding in really nice automated testing, unit testing, integration testing, linting, and things like that.

02:21

And from the right hand side, QA is getting pushed down by production awareness and production monitoring, where things like folks like error monitoring products are are key there.

Speaker 1

02:29

Yeah, it's interesting stuff, and I think maybe the next thing could you just paint me a picture of Like when you talk about like bug managoring or being reporting software, like, what does that actually look like?

Speaker 2

02:39

Like?

Speaker 1

02:39

So suppose I'm working at a big company, I've got a giant React app. Maybe I've also got to React like mobile app. What is my steps? Like, what is my experience actually like? Like how do I install this? And then what sort of thing am I looking at like once the deployer sab out to production.

Speaker 2

02:53

Yes, it depends on the type of software you're running.

Speaker 3

02:55

But yeah, in a React application or a web app stack, for example, you want to be able to monitor run time errors and bugs that are affecting your your end users, your customers out in the wild. And in order to do that, the way that our product works is we have SDKs or libraries that you install your package manager, so actually our software runs as part of your code, is linked in as part of your code, rather than being something that ingestslog.

Speaker 2

03:21

Files or anything like that.

Speaker 3

03:23

And so yeah, you're writing and React at you can just NPN or yarn install bugsnack. If you've got a Rails API powering your back end, you added to your gem file and do buddle install.

Speaker 2

03:33

And the same is true of pretty much every single platform that we run in.

Speaker 3

03:36

So once you've installed that SDK, you set an apike in code or configuration spending on the platform, and then bugs basically sits in the background, taking up almost zero resources until we detect a problem. As a curtain a problem differs on each platform, and then React app for example, we will detect any exception that bundles up to the window error handler on the browser, we will detect any on handled promise rejections, and then React specifically. We'll look

04:05

into React error boundaries as well. So you could use any bugstag provided error boundary and wrap your parts of your code base in a bugsnag wrapper and will then automatically report them off to bugsnags dot com and send diagnostics alongside with it. But the process is pretty similar on every platform mobile, desktop, web browser, just with slight differences of the types of the era that we catch.

Speaker 1

04:26

Yeah, it's interesting. I know we were talking before that. The last time I use some sort of air reporting software was quite a few years ago, and I remember the first time I did it, I was absolutely astonished at what it was spitting out. Because you have this like I think when you work on a big piece of software, like you know there are some bugs out there, right, like you've got some gear tickets that have been opened for a while. You're like, yeah, yeah, yeah, we'll get

04:48

to that. That's sort of a hard problem to solve. But I remember actually putting this stuff in and like you get stuff that like you had absolutely no clue what it meant. And I guess one thing. I'm curious because actually my one problem with using is and this was years ago when these things were probably a lot less refined, but it got hard to like make sense of all the ears because it almost became like there was just this like a huge mess of airs. So

05:13

I'm curious like the sort of things you do. I imagine you have like some sort of like aggregation algorithms that tries to make sense of like well, okay, well these bugs are all the same, or like do you help try to help developers like get at like the root cause like maybe this is like browser specific or that sort of thing.

Speaker 3

05:28

Yeah, it's funny when I remember as been chucking out log files for ages, and a lot of the time you don't think actively to look at log files and you just go in there when you absolutely have to, and they're generating gigabytes and gigabytes of data that maybe you never look at.

Speaker 2

05:44

And then there was this leap from reactive.

Speaker 3

05:46

Error monitoring to productive aeroor monitoring, and there was really early players in the space. There was a product called hop toad, which rebranded to something else, which was super early in the rail space, and people were like, wow, this is really cool, but holy moly, do I have a lot of stuff coming in. And I think the leap that products like hotted initially made and then we've kind of refined over the years is aggregation number one.

06:08

As you say, first off, can we say that, Look, we've had ten thousand bugs, ten thousand exceptions or crashes, but actually all of these ten thousand exceptions or crashes came from the exact same bug, the underlying the same line of code, and so at the most basic level.

Speaker 2

06:23

We have these grouping algorithms.

Speaker 3

06:24

We've born on grouping agorithms that look at the line of code where the bug originated, and it differs depending on the platform. We can be a lot more sophisticated in some areas where we'll look at how similar the code is, will take a snapshot of like seven lines before and after where the crash happened and look at code similarity heuristics, and in other platforms we keep it

06:43

very simple. We don't have to do that level of complexity where we'll say this was this type of exception, so runtime error on line fifty nine of user dot Java, and that's enough for us to say with pretty high confidence that this is a unique bug to this version for example. Yeah, that getting that aggregation and grouping in place is the step one. But even then, you said earlier, how do people move from customer reports and customer feedback

07:09

to having a proactive system like this? Well, not all bugs that we have a T shirt with this, and not all bugs are created equal. And if you just went through this list and said I'm going to fix every single bug that my bugsnog tool is reporting to me, then you're going to waste your life away. You're going to be spending time on stuff that really doesn't matter, especially in the clients side, especially when it comes to

07:29

JavaScript and React applications, because it's the wild West. You've got browsers all over the different places, You've got Chrome extensions that causing problems and injecting content into the dom.

07:39

And so the next layer on top of that aggregation is then sophisticated prioritization tool, so figuring out things like, well, which one affected the most customers, which one affected customers that are paying us the most money if you're going to keep it straightforward, or which one affected customers that are in key states or key flows like a log

07:56

in or sign up flow for example. And so we try to capture as much information as we can at run time and then allow you to create filters, bookmarks and prioritization rules inside of bugsname.

Speaker 1

08:07

It's funny, I didn't even really think about that, but you're right, because someone could be just using some garbage Chrome extension, or like maybe they're even like developing their own Chrome extension. That's just like, you know, totally screwing things up. And if you try to debug that, my god, like you s going to be spending days and weeks. Can you even know, like, for example, anyway you can

08:29

tell how bad it affected the user? Right? Like is this like is there a way of knowing like this is just an error, but it didn't actually affect the user experience versus like this is actually I don't know, like forcing the UI to be unresponsive or maybe even like crashing the tab or something. Is there are ways that you can even tell on that level of detail, Yeah.

Speaker 3

08:46

It's the most The simplest way to do that is to look at what we call the error handler, And so I kind of mentioned this. We use jobscripts as an example, or React as an example here. There's various ways that we automatically detect that a bug has happen, and some of them, for example, we wrap event handlers,

09:02

so we wrap the callbacks to the event handlers. So if an exception happens in an event handler, and that doesn't necessarily always bubble up to your window dot on error, it might just mean that your click failed to do what you expected your click to do because the callback crashed halfway through. That is almost always less bad than

09:19

a bug that bubbles up. Still bad, it's less bad than a bugue that bubbles all the way up to window on error, which basically means no JavaScript is executing in their scriptag anymore, and especially because most people are using bundlers these days and bundling all their jobs script up into applications on to JS. If your JavaScript stops executing in that scriptag, you're boned.

Speaker 2

09:38

That's it.

Speaker 3

09:39

The whole page stops responding. There's other things as well, like a promise rejection handler. Again, if it's in a promise it happened a synchronously, it's probably not the.

Speaker 2

09:48

End of the world.

Speaker 3

09:49

So that's the most straightforward way to look at it, and to say, look, if it was a click event, that's bad, but not as bad as if the entire page locked up. In terms of the performance aspects, though they're a lot more subtle. We have a all code snippets to detect certain things like freezes, and my favorite one is our frustration detection snippet, so it detects rage clicks.

Speaker 2

10:10

So if you, i said earlier, if you've got code that.

Speaker 3

10:13

Made a non click handler fail, fine, but how can you detect if your developers forgot to hook up on click at all to a button? So you've just got a button that looks like it's clickable, but there's literally nothing. And so we've got some snippets so you can drop in that will detect things like when you click on the same domb element multiple times within a particular time window, and then it will send a message to bugsnack saying someone's rage clicking this button.

Speaker 2

10:37

And so things like.

Speaker 3

10:37

That are still as frustrating, maybe more frustrating than a full page freeze, but it's kind of up to the developer to decide, Yeah, this is the one that's that's causing customers the most pain.

Speaker 1

10:48

Yeah, because you said snippet, so is your model. They're basically there's some default handling and then there's extra things you can add on that you may not want to give everybody because they were I'm assuming it works attended to the vent handler, so there's like some small performance here, so you might now want to go nuts with it sort of thing.

Speaker 2

11:04

Yeah, it's more.

Speaker 3

11:05

It's more that we have an opinion that our product is opinionated yet extensible, and so pretty much all of our SDKs are plug in based, I mean our JavaScript one. We just released a new version of this two days ago. The whole thing's built around plugins, even internally, so things start off as snippets sometimes and then graduate into official plugins that we put as default inside of the application.

Speaker 2

11:29

Some of them, like the rage click.

Speaker 3

11:30

One, they're more interesting than actionable in a lot of cases, and so if we ever evolve that to be one of these ones where it's like we are confident that something is going wrong based on these rage clicks, then we'll put it as a default plug in inside of our JavaScript SDK. But actually it's one of those things that people want to tune. How many clicks is a rage click? What time period should I measure? And all

11:51

that kind of stuff. So the stuff that's on by default, the opinion of this stuff is what we think are the most important negative signals inside of your Apple cation typically. But yeah, it's all plug in base, and we try to expose as much as possible in terms of API so that you can hook in your own plugins and

12:09

do your own stuff. Like even I said earlier about reporting handled versus unhandled exceptions, sometimes you've got your try catch and my favorite piece of code to read ever is when it says try catch, and all there is in the catch is a comment that so should never get here, and inevitably it's going to get there.

Speaker 2

12:27

So a lot of our customers do.

Speaker 3

12:29

Most of our customers do is they'll put a bugsnag, dot, notified brackets, E error, whatever it is, and so that way you know if it's got there, and then you can decide if that's a problem or not. But yeah, we try to be opinionated the accentsible as our produp philosophy.

Speaker 1

12:42

I kind of like that for the catch black because I've totally been that person that you go into the catch and you think to yourself, like, I don't even know how in the world this would happen, but like I feel like I can't just leave this empty, right, so I have to put something and it.

Speaker 2

12:54

Will it will happen.

Speaker 3

12:57

The ones that crack me up all the time, I think this is because I'm getting across the old program and now, but try catch blocks where the catch block has just a comment and nothing else in it, and then switch case statements where the default case says should never get here.

Speaker 2

13:12

It's just like cool, let's make sure it doesn't.

Speaker 1

13:15

What's funny too, because the part of my life I did Java code and Java like had I think it

13:20

was like a cert false or something like that. There's some way that you could put in your code that like if almost like at the compiler level, that if this code ever executed, it would have a way of like informing you, right whereas in JavaScript, outside of some tool, like you're saying that, there's no built in way of doing that, right, Like, there's no way of saying, hey, just let me know if this code ever runs, Javas will just merely go ahead, ignore that comment and go

13:43

right on its way, and who knows what's going to happen.

Speaker 3

13:45

Well, people, I've seen people put to throw their own custom exceptions in those cases. But would you rather kill JavaScript execution completely and completely screw your up if it ends up in that case or have it run and then know about it because maybe it was okay that it hit that case. And I think that in jabscript land and client Sideline in particular, you don't have the luxury of being in a controlled environment. You can't just

14:09

open the log file. The logs are on someone else's machine, and that machine is that environment is completely out of your control. So yeah, there's so many like you said, when you add these solutions in sometimes you're surprised at how many bugs appear. I mean, maybe that's because a lot of people don't think about it during development, and then when they do turn something on eventually they're like, oh, oh, look at all these educases that happened to loads of customers.

Speaker 1

14:31

Yeah, it's actually in a way of testament to JavaScript because they like in a way when I think back to job and saying, well, the code would completely stap if this happened. But no way, that's kind of a bad thing too, right, because if a customer hits it and then the app just totally says like oh compiler you know, or runtime air and just totally just dies.

14:49

It's kind of nice in JavaScript that some of these areas can exist and things are mostly okay, right, like it like you still want to know about it, but maybe some they're still able to do. The user might still be able to do the task they they are able to do, so you don't necessarily want to just completely crash in these situations. So I kind of like the notification approach.

Speaker 3

15:07

It's good, it's good and bad you end up with a world that you know is almost a dirty word these days.

Speaker 2

15:13

Actually, it's having a bit of a renaissance.

Speaker 3

15:14

But in PHP, in original PHP, not cool new PHP, you could have an error and then the code would keep executing. It would say, oh, we had an error, Okay, let's just keep keep going. Let's keep going. So you end up sometimes having you know, twenty compounding errors on a page because this one variable wasn't initialized, and then it just kept on trying all the way down the page. And I think that we found pretty quickly that resilience

15:36

was not helpful in that case. You ended up with with this getting into worse and worse situations as the code kept on trying to execute down the page. I think that the trade off is, you know, I think it's okay to have like click handle a fail in some particular case, but the rest of the app continues to work. But I think that it's much harder to diagnose and debug and reproduce problems, and so you end up with, you know, you're getting a from a customer

16:00

saying I got into this state. Then how the heck is the developer or the support person going to reproduce that to get back into that state. That's the hard part with allowing code to continue to execute.

Speaker 1

16:10

For sure, I want to turn the conversation here in a second over to mobile because I know that's that

16:14

opens a whole another can of worms. But do you have I think one last question on the website, are there, since you're sort of the aggregator of the aggregator of bugs in a sense, are there any like really common things that you find people do, or like things that your average developer should be aware of, like common mistakes that people overlook that to just look out for and sort of be cognizant of.

Speaker 3

16:36

Yeah, and it's it sounds so obvious, but it's just by far and away the highest order of magnitude type of bug.

Speaker 2

16:42

That we see, and that is uninitialized variables.

Speaker 3

16:45

Still in twenty twenty, null pointer exceptions, uninitialized variables are the number one cause of bugs. And there's no surprise that languages like like Swift try to come in and say, right, let's let's force things not to be null or uninitialized when the pile level. The other thing I think we see all the time in jovscript especially, and again no surprises here is type errors and problems caused by unclear

17:09

typing or coercion of typing. And so a lot of these things I think can be solved by having really nice linting in place or using a typed variant of JavaScript. We use all of our new code box nexts React app is now in typescript and we have a ton of linting in place. I think we use Airbnb's e clint rules off the bat, But it's we're trying to keep things very tight before they even.

Speaker 2

17:33

Get merged in a PR. But because we see all these problems that come up.

Speaker 3

17:37

But yeah, no points are exceptions unlessitized variables type errors still in twenty twenty, the biggest problems.

Speaker 1

17:43

Yeah, it's funny. It's amazing how like simple linting tools can catch so much of these things. I'm curious when you say uninitialized variable, like like what the specific scenario is like, so I get to clear variable I don't know X right that I'm going to use. How is this an area that's not like cut by the developer during test. Is it that it's like like a different scenario like some if check or something that like there's some case that they're not accounting for or yeah.

Speaker 3

18:08

It's almost it's almost always when one of the things is when we see all these bugs coming in and but we can't see the full source code of our customers. We don't, you know, it's a sensitive error. We don't want them to have give us access to that, so

18:21

we keep it isolated. But what we see in our code, and what I've seen in my career at least, is yeah, when the developer has over confidence in the order or structure of code execution, and so you're like, well, it's going to go off into this function over here, it's going to fill in all these this data and then we'll run.

Speaker 2

18:37

The next thing.

Speaker 3

18:38

But there are about fifty ways that that function that's meant to fill in all these variables could fail. And I mean this is a really I think are really straightforward one. All right, blog posts about years ago. But one of the most common bugs that we've seen in JavaScript land is for legacy applications is jQuery is not defined, and jQuery is not defined as a bug because most people history would put in jQuery from a CDM and then they would run their code afterwards, so they would

19:04

expect jQuery or the dollar symbol to be defined. But because of the way that the job script engines run, if one script tag fails, the next script tag will continue to run and try to run. So if your next script tag, the whole thing relies on there being a dollar symbol jQuery defined, but it's not, you're kind of bunned. And so if you're using some kind of module system or buddle system that has interdependencies, that can

19:28

be the case as well. But it's true of any code that expects something else to be available and to set up. If that fails, you're out of luck. So yeah, it's really just being overconfident about code paths running and not failing.

Speaker 1

19:40

Who would do that though, Luckily I'm that guilty of that, So we're good to go. So I do want to get into mobile because this seems like even more of a hairy territory. So I imagine, like from the web perspective, your code is still going to run for like the mobile web, and so the very similar sort of workflow and sich but you were with reacting native as well as they're correct.

Speaker 2

20:03

That's right. Yeah, it's mobile.

Speaker 3

20:04

Is I talk about clients like being the wide Wild West, and people who are jobs developers for the browser know that the browser environment differences are contentpated the But now I think if you think it's a paint in the bum developing for three four different major browsers. There are twenty to forty thousand different Android devices out there in the wild, and every single manufacturer of Android device is LG Samsung whoever puts their own lot of flavor and

20:33

spice on Android. And what I mean by that is they can do things such as actually edit the core operating system of Android. There was really crazy bugs back in the day on Android when it was a bit more uncontrolled, before Google stepped in and said, hey, stop messing with this. I forget who it was now, so I didn't want to shame the wrong vendor, but a manufacturer of Android devices edited Jason Parsing code in core

20:57

Android to do something different. So if you're an Android developer expecting Jason to be passed and handled in a particular way, it would work absolutely fine on every vendor apart from this one, and then your code wouldn't work. And if you didn't have some if you didn't have that in your test farm of devices, or you didn't have something like a bug stagg in production, you wouldn't know about it apart from someone saying, hey, I've got LG phone or whatever. It is and it's not working.

21:22

Your app isn't working. And so yeah, it's not as easy as spinning up a couple of VMS. You can't have all twenty to forty thousand Android devices out at your desk from all these different vendors. So mobile is hairy, and the more hairy the environment is, the harder it is to build really good, high quality SDKs to these platforms. But yeah, we do support React native and we have to do with that, plus all the other layers of reacnative.

Speaker 1

21:45

And what's the actual like like high level implementation look like, because then the way I imagine like like this is over trivializing, but it's like a window at an ear handler and then a load of logic around that. And native does react native provide like hooks into this or is do you have of like native code that gets into this and finds all the years or how does that work?

Speaker 3

22:04

The latter there's a React native is one of these environments where I actually think that when when reat native first came out, people were like, oh great, this is right. Once deploy modible places, this is going to be awesome. But in reality, I mean that's what kind of expos for these days. But I think in reality a lot of people are using React native to do retrofit work.

22:25

They're taking existing Native applications and they're putting in replacing some chumps in it or some components of it with React native. Now, because there is jobscript code and there is iOS and Android code running inside of most of these applications, we need to make sure that we catch bugs in every single layer. So a layer cake at the top, you've got the JavaScript runtime, even that you've got different types of JavaScript runtime running because you've got JS core versus whatever else is.

Speaker 2

22:53

Being distributed, So you've got that JS runtime difference.

Speaker 3

22:57

I then you've got the operating system iOS or androids layer differences, and we have to capture objective C errors and Swift errors and all sorts of stuff with iOS, and then.

Speaker 2

23:07

JVM errors on the Android side.

Speaker 3

23:10

And then one layer down you've got things like ndk R C and C plus plus errors happening in Android and then the same equivalent happening on iOS.

Speaker 2

23:18

So tons of layers they will have to communicate.

Speaker 3

23:21

Recently, React Native actually don't know how recently it was, but reacnative now supports React error handlers, which is great air boundaries. Sorry, And so that's something that bugs nice always supported and now you can use those in React Native.

Speaker 2

23:33

But effectively, what we have to do is we have to.

Speaker 3

23:35

Be able to capture bugs at every single layer and reliably report them. And sometimes we have to be able to report bugs before the React native engine has even initialized, because there might be some Android code that ran before the React native code initialized.

Speaker 2

23:49

So yeah, we just recently.

Speaker 3

23:51

Released a new major version of our React native notifier and put it all under one mono refone bugs JS to make sure that this initialization logic is buttoned up.

Speaker 1

24:01

Yeah, yeah, that's crazy. I have some background in native script, so like similar technology to React Native, like JavaScript running on mobile, And I remember one thing we struggled with is getting really good JavaScript stack traces. Are you able to when you catch the air on like native land, like give people the like this is the line of

24:22

code where there was a problem. I just really And that's something like React Native exposes for you, or do you have to do some like magic to try to access that.

Speaker 3

24:31

As a lot of magic in every layer, the stack traces are almost certainly obfuscated in some way, either intentionally or unintentionally. In the JavaScript layer, we rely on the source maps standard, which isn't as standard as you think. It's very wildly differently implemented on each platform. But what we do is if we have an obfiscated JavaScript stack trace, we're not going to get that to the development instid of going to be like, what the heck.

Speaker 2

24:56

Is you know function x Y two not what I wrote.

Speaker 3

25:01

We want to show people the line of code that they wrote rather than what it ended up being obfscated into. So we automatically ingest and apply source maps to the JavaScript layer so that we can present the original stack trace to the developer.

Speaker 2

25:13

But like as I.

Speaker 3

25:14

Said, because it's a multi layer system, we also have to do that in iOS and Android. And Android a lot of people use something called pro guard, and progard obfiscates the Java stack traces. We have to then reapply to get the original stack traces back, and then the same is true.

Speaker 2

25:28

iOS.

Speaker 3

25:29

iOS is almost even worse because it's effectively just not offending iOS developers here, but it's like it's c it's low level, and so if a crash happens, what we get is a memory address. It looks more like a classic core dump and so we have to take that memory address and then reapply something called a decent file to it to produce an original stack trace.

Speaker 2

25:51

So yeah, magic all layers of the stack.

Speaker 1

25:53

Yeah, I didn't even think about the programmed thing I had. Actually, I've run across this with Magnaita script experience as well, because a lot of people don't think about the fact that when you think of iOS and Android apps, you think the source code is like compiled and obfuscated and

26:07

you can't just download and use it. But since with React Native you're running with JavaScript code, if you don't take any additional steps, your JavaScript code is just hanging out there right in your bundle, and to a lot of people, especially like I don't know, people dealing with sensitive work or company data, they don't want to expose that, so they do some basically additional obfuscation on top of what you'd normally do on the webs. You end up

26:29

with some absolutely garbled nonsense. I'm actually pretty impressed that you're able to like sort of undo, because that's like unreversed reverse engineering in a sense, to get at the parts you're interested in.

Speaker 3

26:42

I'm glad that there are somewhat standards here, and we've been of evolving the source map standard as well a little bit.

Speaker 2

26:48

But I'm glad there's somewhat standards here.

Speaker 3

26:50

Because there's a bit of a first engineering required, but mostly we're just trying to follow the rules almost and say, right, let's pick this back apart. But yeah, without the aggregation and grouping and without the dealfiscation work with you to provide original source maps, I think that a product like bugsnack would be a lot less.

Speaker 1

27:08

Valuable for sure. The other thing I want to get into is, I know one of the key things you do is help protect against like I think you say, I Reddick GUESDKS writer or other SDKs you use, And I know you were telling me a story of Facebook their SDK sort of going down, So why don't you share like what I guess what I'm talking about right in terms of third party SDKs and a React native world and sort of what you can do about that.

Speaker 3

27:33

Well, I was joking about jquer is not defined earlier and things like that, but you know, modern software, you

27:39

don't write the whole thing yourself from scratch. You're relying on other people's open source packages and SDKs and for anyone who is a React native or iOS developer recently that uses the Facebook SDK, you'll be very familiar with the fact that there were two outages within two and a half months on the Facebook SDK that caused eye applications to crash at boots if they were using Facebook's

28:04

authentication platform. And this is super frustrating because so first off, I like to say, I don't want to anger the ops gods, and so you know, Facebook had this issue.

Speaker 2

28:14

You know, it sucks, but like you know, I give them a break a little bit.

Speaker 3

28:18

It's a tricky one to deal with, but in reality it's a really difficult one for developers to deal with as well. If you ask Spotify, you use the Facebook SDK to allow people to authenticate one day without any code changes happening at all on your side, suddenly your app stops working and you get a ton of bug snaggs or whatever you're using for air monitoring coming in. And what happened was, in this particular case, Facebook's SDK reaches out to Facebook's API to say, hey, tell me

28:45

information about how I should initialize. Facebook's API responded differently to how the SDK was expecting, so it came back and said instead of giving a structure a dictionary, it came back with a boollion. And so the code that was reading that Jason Payload basically just wet.

Speaker 2

29:03

The bend, just like what do I do here?

Speaker 3

29:06

So it kind of sucks because normally developers think about bugs that are introduced as part of a code change, but in this case, it wasn't a code change. The data changed, and it was data that wasn't even part of my application.

Speaker 2

29:19

It was data was part of a third party SDK.

Speaker 3

29:22

And so yeah, holy moly, we had all of these apps that use the Facebook SDKA, which is almost every consumer mobile application completely die on boot. Some of them didn't, and we found that quite interesting. And so because we're a crash monitoring solution and error monitoring solution, we saw all of the crashes coming in from all of these major consumer mobile.

Speaker 2

29:44

Applications that use our products. So we had a bit of a deluge.

Speaker 3

29:47

And luckily my infrastructure teams built an architecture that was almost scaling and we barely noticed the blip, which is fantastic, but yeah, we were like, well, why does this app have this volume of crashes? With this app's fine in reality, it's really sensible defensive programming that some developers had taken and others hadn't. So one of them was wrapping the SDK in their own error hooks, so if this crash happened, it could bubble up to the top and crash the application.

30:16

That one's pretty straightforward, easier said than done, though, because a lot of asynchronous work was happening in that SDK. The other one, which is a bit more aggresive, which actually think is a really good best practice in general for anyone using SDKs, is wrapping the SDK initialization in

30:30

a feature flag. So we saw some shapes of error chart coming in that were like, well, there's tons of crashes, and then immediately went down to zero because these customers of ours were able to turn off Facebook's SDK by updating a feature flag remotely.

Speaker 2

30:46

That was then did not initialize that code for their for their customer base.

Speaker 3

30:49

I wish I could tell you which customers, because there's I give them shout out, but obviously.

Speaker 2

30:53

Obviously private stuff.

Speaker 3

30:54

But you know, there's all these ways that now, if you're relying on third party code, you have to be super aware of all the is that that code could change based on external dependencies and protect against that.

Speaker 1

31:05

Yeah, I'm actually quite amazed that some people actually were that proactive to account for this, because I think in the NATA script apps I wrote, I never once made an assumption. I mean, okay, so it's one thing if like you call it to a third party like API or something, right like, those are the situations. Usually you would have some are handling. Like I'm building a mapping app and I need to get like locations to show

31:25

on markers, and I call some service. Well, I'm gonna have some are handling for that because this is a call.

31:30

But I never I had never accounted the service itself, right, because almost all of these things a native have like some sort of a NIT call, right, you pass it in API key so you initialize it, and usually those things don't even have air handling hooks, right, like, at least in an experience, is always right like it It's not like you call like Facebook, Dot, SDK dot and knit and you have to pass it and on air handler.

31:54

It's just assumed you do it right, and like normally, then later on in your code you just have to like make sure it's their sort of thing. But you know, you never account for BacT that, like what if it did like something erroneous or something that I totally didn't expect. So I'm absolutely amazed because I would definitely find the camp of people that like hard crash for sure and this sort of situation because I sort of assume these things are always going to be there.

Speaker 3

32:18

I've got to believe that the people who did that either have been bitten by this problem in the past and in their post mortem retrospective they were like, let's do this, or they were used to working.

Speaker 2

32:29

In an environment where SDKs are less reliable.

Speaker 3

32:32

One of those is in my former life, I used to work for a company that they gaming SDKs, mobile gaming SDKs, and notoriously add provider SDKs were the crashiest SDKs. Because you've got these companies where they're experts in monetization, they're experts at building relationships with developers and publishers, but maybe their SDKs on't the hottest SDKs around and people are swapping them in and out all the time to

32:57

get the best deal. The business team is saying, right, we need to swap out to use this ad provider because they're giving us a better deal. But no one's saying, are they well known developers and they have a good, high quality SDK.

Speaker 2

33:07

So I know, at least in the gaming space mobile gaming space, people.

Speaker 3

33:11

Were very wary about adding in you ad SDKs and therefore probably more likely to protect against problems.

Speaker 1

33:17

Yeah. And the other thing too, is that in native land it's really hard, if not in some cases impossible, to actually fix these problems on the fly like there. I mean, there are some things you can do, like in a React native world to like hot swap production code,

33:31

but it gets wonky at times. So I'd imagine too like a lot of these would require full like app updates through the app Store, Google Play and such they actually fixed too, So like, yeah, like a fairly significant business last I imagined for some of these people.

Speaker 2

33:46

Yeah.

Speaker 3

33:46

I wouldn't even want to think about the actual dollar amount of that because it woul stressed me out too much. But I know that it was very rare that people use feature flags and turn these things off. It's very rare that people using code push or something like that to hot patch these things. I know from looking at the data that a lot of our customers that solve these problems effectively just rode the wave and waited for

34:06

Facebook because they couldn't do anything else. They had to wait for Facebook to fix it because this is a multiple ow issue.

Speaker 1

34:11

But yeah, Facebook's going to fix it faster than they're going to be exactly.

Speaker 3

34:15

Yeah, And in the end, that's just think that Facebook rolled back the co changes on their side that made the data structure change.

Speaker 1

34:21

Gotcha. So one of the things I want to get into is like workflow from a company perspective and like sort of any recommendations you might have, So like, if we take this example, what is like what would you I guess, what's the ideal like developer experience because obviously, like you don't want to be notified every time like somebody gets jQuery is not defined on your page, but you might want to know if your entire iOS and

34:44

Android user base is suddenly hard crashing instantly. So what are like, I guess what sort of systems you support and what do you maybe recommend for notifications? Like are are like people getting emails for this? Or like how how quickly are people getting notified and able to resc span when something like this happens.

Speaker 3

35:01

So the way we've built bugsnag at least is that we fit. We try to fit into existing workflows. So if you are using Jira or any other project management or issue tracking tool, we support that, and we don't just support it as a creative issue in that tool. We typically have what we call a two way synchronization. So if you send a link of bugsnag, bugget a Dura and someone marks it as fixed in Dura or whatever we're using, that will market is fixed inside of bugsnag.

35:29

And so we support pretty much every major issue tracker and project management tool, and we try to have a two.

Speaker 2

35:33

Way sync on all of those platforms.

Speaker 3

35:35

We even do things like if you've marked its fixed in your dur at, AOL or whatever it is, and then bugsnag detext it's still happening in a later release, will automatically reopen.

Speaker 2

35:43

It and market as a regression.

Speaker 3

35:45

So we don't want to be a product that comes in and says you have to completely change the way you're doing things. We want to be a little nudge in the right direction. So integrations with the project management tools is a key way that we do that. We also integrate with a learning in chat tools, and so most people use in slack. These days or vers teams or something like that you can configure on an application by application basis.

Speaker 2

36:05

I want these types of errors to go into these types of chatterings.

Speaker 3

36:08

So if you are a govership developer, you might want to see all new errors that we haven't seen before pop up as a message. We recently launched something about a month ago, month and a half ago that we call the alerting and a worklow engine, and this is basically a more sophisticated way of routing those things. So you can say, look, I work on the payments platform in the React application, and that is defined as living under these URLs or having this package name in the copath.

36:35

So you can now set up alerts to that match those patterns, to go into a particular Slack channel, or to alert to a higher frequency because Zechy copaths. So I've said this a couple times now, But the client side is the wild West. It's a little bit insane to turn on error alerts for every new error. Some we have the option to bug sit to turn on

36:56

alerts for every occurrence of each error. That makes sense again if you've honed it down to say I want to see on my rails application every time a credit card fails to pass through because.

Speaker 2

37:07

Of a problem on our side.

Speaker 3

37:09

But the web, your code is running on so many different devices in so many different environments that it's a bit.

Speaker 2

37:16

Much to have all of those coming in.

Speaker 3

37:18

So really the default that we have is alertly on any new type of bug that hasn't happened before. And then you can go in and again opinionated yet extensive. We can go in and then fine tune exactly what you want to see. We also integrate with everything else, and like page duty, we integrate with web hops.

Speaker 2

37:34

When to go splunk, you name it, we've got a connection to it.

Speaker 1

37:37

Actually, yeah, that makes sense. I didn't even think about that angle, But I could totally see when I'm setting this up saying like, well, if something goes on here, like I want to like, I want it lug, but I don't necessarily want to know about it. But if a payment fails or if like a user registration something like highly valuable knows, I might mind to ping someone like immediately and cool. Yeah, I could see like the Slack bit of it being pretty nice.

Speaker 3

38:00

So you want to build trust in in Slack. You don't want to be one of those tools, those products. It's a noisy bot, and so what we technically do is we tune it down by default. And the other thing we do is spike detection. So if we detect it there's an unusual increase in in errors on a project, that will ping into Slack by default. But also that's the sort of thing that people hook up to page, you do to your ops geny or whatever you're on

38:22

call system. It's so we're seeing more and more development teams rather than just operations, and for teams having on call rotors so you get worken up at four in the morning if there's an unusual spiking activity, if that Facebook bug happened, for example, in the middle of the night.

Speaker 1

38:35

Yeah, yeah, because that's like the one situation where you actually probably would want to be woke up at four in the morning. That's something it's worth looking into for sure. This this has been sort of fascinating to me. Is there any topics that we have yet to cover that you'd like to get into or do you have any other advice you'd like to give out There two people who have decent size react audience that they should know just about bug reporting in general.

Speaker 3

38:58

I feel like the the obvious one is like, if you're not using some kind of production awareness production error monitoring.

Speaker 2

39:05

You absolutely have to these days.

Speaker 3

39:07

Just it feels like people who aren't doing it these days are just sticking their fingers.

Speaker 2

39:11

And there isn't hoping for the best.

Speaker 3

39:13

And so I think most of the audience is already using something, even if they've homegrown their own thing that's a window not on error and send an email or something like that. So that's the kind of first thing. But I also think that, like you said before, that error monitoring, instability management's come a really long way in the past five ten years, and you don't need to declare bug bankruptcy anymore.

Speaker 2

39:33

You don't need to just turn on as all we are it's too much.

Speaker 3

39:36

We just give up if you and your One thing I talk about all the time is that engineering and product teams actually should be really really aligned on what a bad bug is and the definition of when is the right time to work on bug fixes versus.

Speaker 2

39:47

Getting that new feature live.

Speaker 3

39:49

So have a tool that monitors is in production, and then have alignment inside your development team that says we are going to fix bugs that pass this threshold and we're going to stop on new product development if our stability drops below a particular percentage. So I genuinely think if you align between products and engineering, or some people call it business and engineering, but those two teams need to have alignment in order for you guys to know when the right time is to fix barks and clean

40:17

up technical debts. And there are two things that I've evangelize.

Speaker 1

40:20

Well, yeah, and I definitely agree with that with my experiences as well. Like it's I think, like you said, it's still far too common for i mean, really huge organizations in some sense, they're really important apps to just be totally flying by right, this is just not doing anything. So it's a good note. I think it's a good note to end on. So why don't we go ahead and move into the picks? So I have just one

40:41

pick today. I've gotten into I don't even remember, like the weird way the internet works what gots me into this? But there's this guy. Have you ever heard of the guy named wim Hoff. He's oh, yeah, like this. I think he's this Dutch guy that's like famous for going outside, like shirtless and like climbing mountains in the snow sort of thing. And for some reason this just fascinated me. So I've got a book on him called What Doesn't Kill Us, and I'm a few few chapters into it.

41:07

I've been just listening to the audio book and it's it's sort of fascinating. It talks through the science of like is it just this dude's genetic makeup that's that allows him to do it? Or is it just training, right, like getting getting more accustomed to this? And the answer seems to be a little bit of both. But it's it's interesting because it definitely takes more of like a scientific perspective on like how in the heck is this possible?

41:26

So I'd recommend it if you're at all curious about that. I check that out.

Speaker 2

41:31

So Pick is not the guy who can mentally control his body temperature?

Speaker 1

41:36

Well, yeah, that's that's where it gets. Yeah, that's where it gets into like a little bit of craziness because like you find yourself as you read through this going back and forth between like this guy is just looney Tunes crazy a little bit, But at the same time, he subjects to himself to like scientific studies a lot, right, Like, unlike a lot of these people that are like your

41:54

clerics and crazy like sort of nutcases. He actually submits himself to scientific studies a lot, Like he's been put through all sorts of rigorous tests and such as well, so like some of what he claims has actually been proven. But then they're part of like he's controlling this with his mind. That's the stuff that's sort of like. But it's interesting. It's interesting nevertheless, I guess.

Speaker 2

42:16

Yeah, on my picks.

Speaker 3

42:17

The thing that I've been playing with a lot recently is this new game that just came out just a few days ago called Fall Guys.

Speaker 2

42:24

I don't know if you've heard of this.

Speaker 3

42:28

The it's on Steam and it's the PlayStation game of the Month pre game of the Month, and it is ridiculous. It's like a Battle Royale game but cross with like Mario Party mini games, and it brings out the best and worst in people.

Speaker 2

42:41

But it's insane.

Speaker 3

42:43

It's so much fun, and it's a relatively small developer that's built it, and I imagine that their servers are getting hammered right now. But it's an insanely fun game. But yeah, apart from that, I don't get much time to do hobbies and side things at the moment. But my fun pick at the moment is this weird scene called the console portabilizing scene. And if you're I come from a gaming background, I'm a big gamer. But there's this huge scene of people who take games consoles and

43:10

then chop them up and make them portable. And so my little side hobby has been taking Nintendo Wheeze, chopping them up and making them into like game Boy form factor and things like that.

Speaker 2

43:21

So that's my other little pig little side hobby.

Speaker 1

43:24

That could be a lot of fun I've wanted before, because like the virtual console type stuff is, it's it's fun, but it's not quite the same as like holding the actual thing in your hand. Like there you get a nostalgia over time of like, man, I really like there's something that kicks in when you just hold like an old Sega Genesis or Super Nintendo or any s controller. It's just like I don't know if it's just nostalgia or what it is, but it's it's different than just

43:47

playing it virtually. So it's pretty cool.

Speaker 3

43:49

Yeah, you want to sit down and actually complete the games rather than just like I found with emulators and things, you'd pull them up and just go through ten games and say, oh, that was interesting, next game, rather than sitting down and having that arch experience.

Speaker 1

44:00

Yeah yeah, Well James, this has been a great chat. I think my last question for you is where can people find you if they want to, you know, ask any further questions follow all of what you do.

Speaker 3

44:10

I'm on Twitter at loop j l OOPJ. I'm not super active on there. I'm trying to get back into it, and then I'm kind of relatively active on the conference and speaking scene as well, So catch check out my Twitter or bugs next Twitter to see which conference I'm that next. But I do talk a lot about technical debt on the conference scene, So if you're around and you see me, drop in and I'll say.

Speaker 1

44:29

Hi, awesome. Well, thanks again for joining us. It's been another episode of React round It, so have a good run everyone.

Speaker 2

44:35

THANKSDJ.

Transcript source: Provided by creator in RSS feed: download file

Error Reporting and Bug Monitoring with James Smith - RRU 278

Episode description

Transcript