Scaling and Shopify with Kir Shatrov - RUBY 633 | Ruby Rogues podcast

00:05

Hey everybody, and welcome to another episode of Ruby Rogues. This week counter Panel, we have Nate Hopkins, Hello everybody, Andrew Mason. Hello. I'm Charles Maxwood from dev chat dot TV. And this week we have a special guest and that's kier shatdrop here. Do you want to say hi? Let us know who you are. Hi, my name is Keir. I'm a production dinin. You're at Shopify where I work on the scalability in the

00:28

platform, and I'm based in London. A cake nice. Now, Shopify doesn't have to deal with any scalability, right, I mean they only run like half the shopping carts on the web and things like that. Right. Oh yeah, So I'm curious as we dive into this. You know, you gave us a couple of articles. One was on the state of background jobs. The other one was on like capacity planning for web apps. I kind of want to start with this and dive mostly into when should I start

00:56

caring about this? Right? Because if I have a small app, it matters a lot less for a while, and then eventually I'll get enough users or enough people using the capacity to actually go all right now, I really need to start thinking about this, So, Yeah, where do you find that the cutoff point is for this kind of thing? Definitely, there is a lot of talk and technologies that it's natural for engineers to be super interested

01:23

in it. But the price over engineering things and choosing some solutions that are maybe too complex at the stage where your project is right now, that price can be too high, and often the most resourceful thing you can do is just deployed on Heirocu and let it run, and it will cost a few

01:44

hundred dollars for your Hiroco bill. For me, I think the cutpoint is around the time when you start losing the control of maybe you're hosting costs or you noticing that whatever scalability promise you have start hurting your customers and you start losing money, either as a result of your customers being unhappy or as a result of the thing costing to run a lot more than a company kind of work to run the business in a reliable way. Yeah, that makes sense.

02:22

It's interesting too that you've kind of tied it to those two practical breakpoints,

02:25

right. A lot of people they try and tie it to well, I have a certain number of users, or I have a certain size of an app or I have you know, a certain amount of server capacity or you know, stuff like that, and it's it's interesting to me that a lot of this, you know, you've tied it back to oh, it's impacting the customers or oh, you know, it's it's impacting my bottom line, and then it's like, oh, okay, how do I deal with

02:47

this? I also think it's interesting that you mentioned that, you know, it's easy to do if you just hand it off to Roku and let them handle it. And I know that I haven't heard it as much from Nate, but I've definitely heard it from Eric over at code Fund that that's kind of his approach. He doesn't want to deal with DevOps. He just wants to push it to the cloud and then, you know, let them handle it, and he's willing to pay for Heroku to do it. Yeah,

03:09

that's that's our philosophy right now. But I mean we're also short staffed, right Yeah, so we've got two well really, we're just one and a half developers on the project. Other than we've got plenty of contributors that help us fix bugs and things like that, but there's only two of us that are full time. You know, looking at code, and Eric's really only about halftime looking at code, if that right, So we don't have the time of the bandwidth to really delve deep into into you know, the ops

03:40

story. That makes a lot of sense. So I'm curious, Nate, at what point would you guys consider moving off of Heroku? I mean, would it be a cost thing or would it be something else? You know, we're still we've found product market fit and we are trying to scale it now. We're trying to scale on the sales side. So as soon as we have enough customers and enough consistent revenue flowing in to allow us to kind of back off and look at our operations story, that's probably the time.

04:11

So I would say we're probably maybe six months away from you know, having the luxury being able to look at that. Yeah, that makes sense. So Keer as somebody gets to that point, you know, and I think this might be a relevant conversation then for Nate. But you know, when they get to that point and they're thinking, Okay, we're going to scale this, maybe they move it off of Heroku and onto you know, a Kubernetes cluster, or they move it on to you know, a virtual private

04:36

server, something like digital lotion or something. What things should they be looking

04:41

at then to scale their their stuff up. For any hosted services, like for instance, it's common to use hosted database as a service, I think it's important to look at whatever limitation that service provides, because any hosted service would have some kind of those I remember read a blog post where an app had a very specific requirement for some postgrous extension that they've been using, and they switched i think three three providers that gave them Postgrass's service, and they've

05:15

been unhappy with each and they obviously spent a lot of efforts, and finally they got to run postgrass on their own because having that very extension and requirement that was a huge point for them when choosing a provider like that. It's important to understand any limitations and and from another angle, I think there is there is so many scalability related problems that you can run into that usually it's

05:44

you start looking at the one that's most critical right now. Like I've I've been part of projects where they've run into scalability issues with the database layer with my sequel or with progress and as they fixed it and iterated it on it and their database could accept a lot more load. They came to another bottleneck, and that bottleneck is different every time, depending on the business, depending

06:14

on your patterns of the usage that's coming from your customers. So it's fixing one thing at the time, one by one, and sometimes that's a never ending story, especially if the company grows large and there is a team works just on scalability, which is currently the case for my team of Shopify.

06:34

Yeah, that's a terrific point in terms of really, this is not a job that ever completes, right, It's something that you're always having to stay on top of it, especially if the company is enjoying any level of success. One cool thing about code fund is we are even though we're on Heroku, we're able to leverage some of the postgress at more advanced postgress features like table partitioning and things like that, which has enabled us to continue to scale

06:58

on that platform. We're hosted on one hundred and sixty plus sites right now, and so we're seeing between two and a half million and three million requests a day pipe through the server. Now. We are paying a premium for Heroku, but we're still I think we're under eight hundred a month on our on our production setup, and we're probably a little over provisioned in anticipation of spikes and things like that, and so we don't quite have the fine tuned

07:26

control that we would like to have. Your point on postgress, as you want to customize that and install your own plugins and things like that into the database players, that would be something that would be fantastic because since we are using table partitioning, I know there's some plugins that just are not broadly available on the Heroku platform that would be kind of a luxury to use for us that we've kind of had to work our way around some of those things.

07:54

I'm curious about your experience and time with Shopify. How long have you been with the team and what types of changes have happened since you've been at the company. I've been a Chopify for almost for years, and I've always been part of the production engineering department, which deals with the infrastructure and is less

08:16

exposed to the product. And just that department grew so much from maybe while I've been here, from maybe thirty people to now more than one hundred, and all of those people are working on the infrastructure and reliability, and with the motto of that, our job is to keep the site up. There's another aspect of scaling here, going from forty to one hundred people, Like

08:43

how has the team scaled? Like what's the dynamic been? Like, Yeah, it's interesting to follow dynamics in terms of team scaling in every organization, and I imagine it's a different story. It affected so many things. Like for instance, at the time when I joined, our Shopify is based in Canada and most of infrastructure engineers were just one office. Now people who work on the infrastructure are based in three offices, and there is also a lot

09:16

of remote people like me. And then as you grow, you end up investing into some of the things that you would never invest before and have teams who work just on one part of development environment for instance, or just on background jobs infrastructure, something that I wouldn't have imagined three years ago. So what is the technical portfolio for Shopify around and like how has it changed since

09:46

you join? Obviously that's a great question. There's been a lot of new tools and techniques and stuff that have come out, but you know, just over the last four years, and so I'm curious with the evolution of tooling has looked like, Yeah, that's a great point of discussion. So I think first there is something I wanted to give the context to our listeners.

10:07

First is that when Shopify was founded about twelve years ago by Toby Lutke, Toby was one of the first contributors to Rails and he knew David djh and they exchanged some emails and around the time when he started company, when he started Shopify on rails, rails was just a ZIF file that they exchanged over an email. It wasn't even some specific version published on a GEM server because I'm not even sure there was if there were any GEM servers at that point.

10:43

So from that day when he started on rails, that app still exists. It was never rewritten. It's a monolith that has been around for more than a decade. We tend to put a lot of love into it to make sure that developer experience stays great. Unlike it often happens that a monolith is just too slow and too hard to work with that developers get so much friction and decide to go splitting or calling the monolith a legacy. It never

11:18

happened for us. I've got to interject and just ask a question on your monolith in terms of, like I know Shopify is a very large company, how many developers have their hands in the monolithic code based my rough guess would be from one hundred to two hundred people, given that R and D in total is a lot more because there would always be people working on other part

11:43

of stack, also mobile developers and so on as you can imagine. So back to your point about how has the stack changed in terms of tools that are familiar to listeners of our podcast, it's still pretty much a classical rails up with all the things that come with it. In terms of the infrastructure, I think the biggest shift that I have observed of the company was moved

12:07

from physical data centers to the cloud to Kubernetes. And that's another who interesting story because we were able to move to Kubernatus in cloud one shop at the time, Given that we have millions of them, we wanted to make this process as continuous and find control as possible, so we just took one shop, moved it to cloud and progressed and we were able to control that. It's fascinating to me that you have upwards of two hundred developers working on a

12:41

monolithic RAILS code base. Like some conventional wisdom that I've heard in other circles and certainly bumped into in my career has been that if you're going to scale your organization, you apply conways and break out into micro services. In the conventional wisdom seems to be that that's really the only way to do it, and you, guys are a terrific counterpoint to that. What are some techniques

13:05

you've used to facilitate it. I think one of the biggest has been adopting domain driven development development and splitting that monolith into I would not call them name spaces, but it's kind of components at least that's how we call them. There is nothing very secret or special about it. It's basically just a way to structure your app directory so that each team, each component gets their part.

13:35

Therefore, it helps a lot to establish the ownership because, for instance, as soon as you see an exception in production in some of the exception tracking service that you use, you see that exception is coming from components Slash support, Slash app, Slash model, slash something. You immediately know that a support component and you have all the metadata to find people who can help with that, even a non call escalation or a Slack channel where you can

14:05

chat and point out. And we started leveraging that for some of the to automate some other things like, for instance, if exception within one app happened in that component, will send a notification to their Slack channel, not to some generic Slack channel with tons of exceptions from all over the company. Establishing

14:26

those ownership is I would say, the main technique. Okay, so domain kind of a domain driven design, and then you give a team like full stack responsibility or at least all the areas of the stack that that particular domain piece may touch, right, so that could slice all the way through front

14:43

end, all the way down into the model layer. Yeah, it's not as strict as you can imagine, and there would always be cases of reaching out directly from one active record model to another through components, through different domains, and that's not great. We try to build tools to discourage people from doing that and for them to know what are the right patterns. Like for us, it's mostly entry points that are well that are typed and declared and

15:16

documented. So this is kind of shifting gears a little bit. I'm really curious about the database infrastructure because I know on Shopify, essentially you've sharded the database or maybe not sharter, but there's multiple instances of the database, right that are all that backs this. How is that structured? And how do you manage that from an OPS perspective? Oh yeah, that's also a great discussion point. So also to give some of the context to the listeners.

15:45

For all well known rails companies like Shopify, Gethub, based Camp name a few that's been founded around ten years ago. At that time, my sequel was that best known database that everyone knew how to run and operate. People were the most familiar, and some other like posgress were not maybe as good or as established at that point. So that's one huge reason why this subset

16:18

of companies, including US, are all based on my sequel. And yeah, at I think it was around twenty fourteen twenty fifteen when we realized we can no longer fit everything into one dB. We figure out we have to find a way to scale horizontally, and for a multi tenant SaaS application, there is a great way to do that. Since your tenants are always isolated,

16:48

you don't have to. You don't have any joints between multiple tenants, so you can put tenants through different charts, through different partitions and manage those independently, which also reduces the blest radios. If you have hundred charts, one is down for whatever reason, only one percent of your customers are getting some negative experience, and you go and fix that as as soon as possible. But it's not all of the platform. So we invested a lot into

17:22

charting. In terms of application logic, it's it's mostly done on rails layer. We have a rails team at Shopify that that helps to steer that into the best direction possible, at least from the rails point of view and from the opps point of view, it's it's just a lot of charts that that can be located even in different regions, and which also can allow to isolate some tenants geographically. So let me just recap to see if I've got the

18:00

picture in my mind correct. So we've got a rails monolith that's kind of structured with kind of these domain areas of responsibility. That's how you structure your teams and the way you've scaled this at least up to this point in the conversation is you're just dealing with gut like just mountains and mountains of data,

18:18

So you've sharded your multi tenancy across different database nodes. For the developer, it can just look like a typical rails application, correct, And something to add is that we our goal is to make that all that starting complexity gidden away from developers who right product features for them. It may feel like there is just a database with a lot of tables that represent the business model, but underneath there would be some smart sharp selection that would happen at the beginning

18:53

of the request, for instance, that would select the right database. And I mentioned this just for my sequel for relational database, but we've realized that it makes no sense to have shared it my sequel, but just one global redditis because regardless of how well you shared that one global redis or that one global memcash would still be a single point of failure. And as you can

19:23

imagine, we learned that lesson by experiencing those single point of failures. So our philosophy is that every resource would be sharded, so there would be a smaller instance of shopify that has its own My sequel that has its own raddits that has its own memcash that helps with this isolation. So with each web server essentially or maybe partition of web servers the scale horizontally, all of those would not necessarily have a local copy of them cash and read us, but

19:57

maybe just a shared one that cluster of web servers. One thing I should note is that stuff like web servers, it's still all shared capacity, and it's mostly it's only resources that are isolated. So any web server can talk to any to any partition or any like smaller instance of Shopify, it's mostly the matter of selecting the right path depending on what's the customer. So now I'm a little curious in terms of because there's obviously a pretty significant coordination piece

20:37

there. You know, when the request initially comes in and then you assign the correct mem cash server, the correct redit server, and the correct my squel server. How much of that infrastructure did you guys have to build Shopify and how much are you leaning on the database providers for those things? Honestly,

20:56

I think it's mostly all in house built. And to give a bit of context about that, it's mainly a component called sortine hat I like the name that is using sound the sortine hat is using a global lookoff table to

21:15

find which which domain, which shop is on which partition. It gets the partition and then it goes to the location of that partition can be US West, Central, use East, somewhere else, and then it just hits the right database located in that region, and the right through all through rails and mostly through HDP headers with And what's what I find very interesting is that we were able to build all of that on top of Engine X since Engine X

21:49

allows you to write scriptible LUA modules where you can implement any kind of logic in those local modules. In Engine X, you can query your database to look up something where that tenant leaves, and then you just proxy that through Engine X and you manipulate the headers and just make this work. So it's quite a lot of infrastructure that we had to write. But at the same time, as I talked to call different companies, it's all custom tailored and

22:22

there is no there is rarely a same stack, same use case. So that's also that would be a bit hard, maybe a bit hard to share and abstract. So yeah, how much of that infrastructure tooling is open sources that all secret sauce internal stuff, or have you open sourced some of it which try to open source quite a few things. There is also a lot of conference tocks that will link to show notes that give way better over of

22:55

the architecture. Then I just explained the routing layer itself. I wouldn't say it's open sourced, but there is lots of information out there for someone who who would want to build and use same techniques. So that's probably a good segue into you know, additional scaling aspects. So you've you've addressed a lot of the persistence layer pretty much the entire persistence layer horizontal scalability, but you still have response times to deal with, right, And so one way to

23:26

make response times fast is through background jobs. And I know you've got quite a bit of expertise there. What is the approach and architecture of Shopify's background

23:38

job system. Well, and just to pile on here real quick, it seems like when people start talking about scaling ruby at or rails apps or sondraps or whatever, this is one of the first things people reach for, right because any long running task they just you know, shunt it off to background job and you know, report errors back to the user if they have to, and it shortens the response time because then it's hey, go do this job instead of I'm going to grind through the work of doing this job.

24:10

Yeah, and before you jump in with an answer too, I mean one thing to bear in mind is like some of the stuff is just it's baked into rails with active job. But you don't even have to set up redd us or anything like that to support it, right, It'll run it on a background thread out of the box. So what is the path for developer kind of chucks lead in question? You start on a small project that's maybe a little hobby thing, and it starts to get some traction and then maybe

24:36

it turns into a business. What does the evolution of kind of evolving that background job handling look like over time? Oh yeah, And to note that like myself or some of the byprojects, I run background jobs exactly in the background thread in those uma processes. Yeah, just because it makes no sense to pay for extra for instance, kick down as on Hierroco for those bad

25:02

projects. And exactly as you pointed out, it makes sense to start with something as brutal as a background thread, and then I'm really happy that Ruby community has a project like Sidekick and Mike Perham who is behind that project, who has pushed the community to adopt some beast practices around background jobs, and also offers nineteen nine percent of what community needs as an open sound project, and for the remaining of one percent, when you get to that point,

25:41

you can buy a pro or an enterprise edition, and I'm pretty sure that when anyone is at that point, that's actually quite an affordable software to buy. As a company, and just like most of the community who is using Sidekick, Shopify is very similar in terms of setup. Because we've been around for so long time, such a long time. We've started with Rescue if anyone remembers, that was a pre Sidekick era library to basically achieve the same.

26:21

So we still run Rescue, we run reddits. We got to rewrite most of Rescue internals because we're multi tenant and we want to share some of the capacity and reuse that between tenants, which we can dive into if if you say later, I guess the first question from you and from some of the listeners could be why we're not on Sidekick, And the answer I would say is mostly the legacy part and also how much we know the stack and how much we customize it for us at this point. But we're all so

27:00

starting some smaller apps at the company, some smaller rails apps. In fact, in addition to the Monoli, you probably have a couple hundred other smaller rails services for something very specific or maybe something just employee facing, and all of that would use the recommended set of libraries that includes Psychic. Yeah, that makes sense. I'm also working on a software as a service. I'm sponsoring one of the bigger conferences that serves that niche podcasting in August, and

27:32

so I anticipate that things you're going to grow. And yeah, I have a lot of things that I am pushing into the background jobs right now just because you know, I want to get the response times down. But one thing that I'm wondering about, and I'm kind of tempted to go with Heroku, but part of me, I don't know, I have this mental block about paying for something that I could probably figure out the scaling on myself or at least do some you know, a couple of minor things to help with

27:57

the performance and scaling that way. So what should I be looking at next. It seems like you all have kind of gone toward the cloud, and I'm wondering if that's the right answer, or you know, beyond background jobs, what's the next step? A step to reduce response time? No more, it's more a step to just get it to scale, you know, get that you know, be able to handle more traffic without having the site

28:19

slow down. Right, there would always be some kind of bottleneck, which is depending on if you have a good setup of tools, should be possible to find. And for us, that bottleneck has changed through the time, And I would guess there is no single answer because maybe there is something in a web server, in a controller still spending quite a lot of time which

28:48

which slows down the response time. Or maybe it's all database that's a bottleneck, or maybe it's it's reddis or maybe the rails reaches out to some external service that is not located too close to it, which increases latency and also impacts response time. Yeah, that makes sense. I'm curious what criteria you use to determine what should move into a background job. Obviously you may hit some latency on a particular request and see something that is kind of low hanging

29:23

through to move to a background job. But just because you moved it to a background job doesn't mean you've actually addressed the root of the problem. You've

29:30

just moved it out of the request flow, right. Oh yeah. And a very common batter that I see in people do with jobs is, for instance, you want to iterate over all users in your app and do something about each of them, maybe remind them that they need to add a credit card or maybe something expired, or you want to send them an engagement email. When you start, you have just one hundred users, so that job works off pretty quickly under a minute, maybe depending on what kind of work

30:07

that is. You grow to thousands, hundred, thousands, to millions, and a job to iterate over a million users and to check balance of each of them, that job starts taking days or weeks. And how do you solve that? And it's just so easy to introduce that problem. You just do user dot find each in a job and it works, but until the point when it stops. So the way how we solved it, and that's

30:40

actually all open source. We'll also linking a show note. We've solved that by making every job interruptible and preserving a cursor so that a job would progress for a bit and then maybe it would get restarted for some reasonasically, this allows us to iterate over really long collections and do some work with them and never lose the work that has been done. Nice. Yeah, that's really cool. I'm gonna check out the Shopify job iteration. That sounds really really

31:15

interesting. One of the things that we've done a code fund is when we're iterating across of course, we'll do like a find in batches, and then we will just in queue the smaller work, so when the large job fails, it's essentially item potent and can be just rerun again without without impacting things that may have been half processed or halfway chunked through. Yeah, that's the

31:37

approach that I take as well. An interesting side effect of that could be that, again, if this leads to a fin out of a million jobs, because if you have ten million users and each batch is side of ten for instance, like the numbers don't really matter, but the point is that if the fen out of so many the jobs, we need to remember that something like credits is always limited in memory, and there's been so many times

32:08

across every I would say across every organization where I worked, that people would push reddits into out of memory state, and unfortunately there is no I would love to have a great solution for that. But every time we want to do something like you describe, iterate in batches, thank you something, we have to be mindful about what's behind that. And yeah, I've been that as well. You start dropping jobs because there's no memory left. Certainly happens

32:44

at times when there's when jobs might be failing. Right sidekick for it gives you some pretty nice failsafe capability where it will reattempt those jobs. But if you've got a bug and not a lot of memory dedicated to your reddits, instance, then of course you may start losing work that may be critical to the business. Yeah I could see that. I haven't run into that myself,

33:06

but I could definitely see that happening. This is a great reminder about all sorts of data databases that exist there, and maybe push push someone to learn about that, because at the end rad as so reddits is in memory

33:19

database which is bound by some ram that you give. It can be gigabyte, can be four, can be sixteen, and that backlock of jobs would not be backed by something that's that can be written on storage that's bigger than RAM like like which would be DISC if it's, for instance, my sequel progress. So something that we would really like to find is a store that could persist those things on disc with a performance not too far and features not

33:54

too far from radits. Reddis does have the capability to push right to disk, right to flush itself out to disc. Yeah, So that only helps to have a snapshot in case the computer where REDDITS is running rebootst but it still doesn't allow you to store more than you have than the RAM that you

34:16

have. Yeah. I mean, that's probably a great argument to move to cloud, right because on Heroku, it's just one button click when you see the memory filling up to scale out or scale up your REDD storage capacity. Yeah. And a lot of cloud databases or cloud instances they have methods were compensating for that, and so they will just migrate you to a bigger instance or you know, basically allocated to allocate it new memory without you even having

34:45

to click it. As far as that workflow is validated and people are certain that it will work. That's a great feature of cloud providers. One of the thoughts that i've I've had architectural which would be kind of neat on the background processing would be some jobs obviously are a bit more ephemeral and less critical, and they could be handled in a little bit more localized fashion, So it'd be neat to build a routing layer that was intelligent where you maybe had

35:15

three stages of reddus or just background job storage. Right. One could be this is very ephemeral and not very important, so we'll just let it be handled in process on a separate thread, so we'll route that job over there.

35:31

Or it may be that this web server the job is still kind of ephemeral, but a little bit more important, So we could have a dedicated redd instance sitting on the web server that has just a small set of dedicated memory for that, and you could push those jobs there to handle some of that back pressure, and then for the really important stuff, you could hef to those off to like your appliance tre of reddis storage that gives you the

35:52

full capacity across the entire application. Oh yeah, we haven't done something like this for jobs though, I think it could help a lot. But in general, like in terms of building systems, I think This is a common case of defining priority for different workloads, which also allows you to shed some of the load. So, for instance, you would have it doesn't have

36:17

to be jobs. It could be something as basic as web requests. And there are requests that go to something that's very important to the business, maybe checkouts, which has the highest priority. Then you have something medium priority that may be browsing just the admin, and then you have something low priority like checking out robots X or checking out site map or hitting an API. And by declaring priorities to those requests when you're at the load, you can shed

36:55

some of those that you don't need. And this idea comes mostly from the largest companies in the industry, Like Google has lots of papers and books how they do it, and as you can imagine, every request to Google service would have some kind of priority and they actually shared those like I'm pretty sure

37:17

that mail is higher priority than watching videos on YouTube. It's really interesting and one of the neat things about sidekick is it provides like in terms of if you couch that in terms of background jobs, sidekick provides some of that facility just out of the box, even for a simple deploy right because will you can you can prioritize. You can say this is in the critical queue, this is in the default queue, is in the low priority queue, and

37:43

Sidekick will drain the higher priority queues first. Now you could start there and then and then eventually expand out and say, well, I'm going to give a set of dedicated worker virtual machines or dinas or whatever to process a particular queue. And I may even give us up dedicated reddis instance or tier for that particular cueue. But you can start with just a simple Reddus instance and the default Sidekick configuration. Say just for anyone listening, because when we're talking

38:14

about like scaling large systems, right like Shopify. But if you're starting a rails app, for me, the go to is pretty much I always reach for Rettus, Postgress and Sidekick, along with everything else that comes out of the box with Rails. That's pretty much what I always go for when I start a new project. Yeah, I mean, I use I've used Rescue in the past for a lot of projects, and then yeah, I've moved into Sidekick for my newer stuff. But yeah, when is it too much

38:42

to background something? Right? So I wrote a gym that allows me to essentially background every or any method that hangs off of an active record model, which is really convenient, but what I've found is it makes it almost too convenient, where if something seems to be slowing down a request, you can just do it dot defer to the method name and it would stick it into the background, which is great, but it got abused and we ended up

39:07

with far too much running in the background, hitting those problems you're talking about, like exhausting memory and stuff. So how do you how do you determine what should be backgrounded? That's a good question, and frankly, as someone who's spent quite a lot of time on that part of stack, I'm not sure there is a single answer, and I think it's somewhat related to how For instance, if it's active record and sequel quarius, how heavy are those

39:37

quaris? If your request i'me out is thirty seconds and just one sequel query, that's for some reason heavy some kind of aggregation takes ten and you need

39:50

maybe to run a few of those. There is no way to fit that into a bub request, And of course it might not make a lot of sense to do the premature optimization, and it can be fine to just start with everything in a web request in a controller, and then you find out that's the thing where your apps spends most of the time in a web request, and you just move that to a job because for simple apps, that's maybe it will be part it will never be a job, and it will

40:20

scale fine for the next few years. Yeah, I wonder if a good approach would be to first This probably very much depends on if you've got paying customers that are being impacted, right, So, if paying customers are being impacted and you've got just some inefficiency and a query or some aspect of a web request, maybe you background that, but you also set you put it in some type of planning process where you revisit that job and try to actually

40:47

optimize the real root of the problem. Yeah. I tend to use the background jobs when I have a performance issue in the request pipeline, like we've talked about before, and then if there's problem with running it in a background job, you know it's timing out or you know something's breaking or something like that, you know, then I revisit it from there. I don't know

41:07

if there's a silver bullet. I think a lot of times it's context specific and you just have to Okay, I'm moving this out of the request pipeline. Okay, now it's having a problem here, So now I've got to address the issue there. And yeah, you know, eventually it kind of bubbles itself up to the top of your tech det queue and you address it.

41:27

So one thing before we wrap up, do you have like some favorite tips or tricks or approaches that you do it shopify or have done at other employers that make this easier, or you know something that you just feel like is something that you did that you're proud of. Yes, For someone who is curious about performance and fixing those kind of bottlenecks, my best advice would be to study all the set and variety of tools that you can use.

42:00

These tools can be as high level and web based and simple as muralk and some of the similar services that you can connect to your app and see insights. Two more system level tools like for instance, as trays. The amount of times where as trays saved me or and some of my colleagues at the middle of the of the service disruption just it's so hard to count those And my advice is not necessarily about as trays, but knowing the wide variety of

42:39

tools that you can use. Some of those tools are very Linux specific and system level. Some of them are Ruby level, like arbispy, a great tool by Julie Evans, or arbitrays, and then there are some services that offer that those kinds of things. So if you know that range of tools and you know which one is the best for something that you're looking for, you pick it up and fix the thing. Anyway, You've got to wrap up soon. I've got a couple, just a couple of questions to put

43:15

you on the spot here. One is, do you know what the request volume that chaff of hy does per second? The public number that I can say is about eighty thousand requests per minute. And what about background jobs? How how many background jobs you are being processed per minute? That's a great question, and to be honest, I don't remember those numbers just out of my head. Yeah, yeah, I have probably suff It's a lot, right, Yeah, it's a lot, and it can be very spiky.

43:50

And there is a huge difference from steady state and spiky state. Because shopify is also hosting some of the words largest sales, sometimes for celebrities, sometimes it's worldwide cups and some special sales that where millions of people try to crash Superfest stores. Yeah, I can imagine code fund is tiny in comparison. Since January we've done over three hundred million. Wow, that still feels like

44:23

a lot to me. Yeah. We keep changing what's in the background, what's not in the background, so that we've had that number kind of artificially inflated at times. But still, yeah, that's a lot of background work. Yeah, makes sense. All right, Well, I'm going to push us to picks, Nate, do you want to start us off with the picks? Sure? So I guess one pick for me today is open source. How fantastic open sources are good. A thing on the side that I'm

44:52

doing for my brother in law and it's basically a CRM. So I went kind of diving around for open source tools that I might be able to use to set up for him, and I found fat free CRM, which is a rails based CRM. It's a bit antiquated on the uh you know, the way it looks in terms of the UI and UX, but it's pretty fantastic that data models solid and it meets all of his needs. Which is

45:21

terrific. The other pick I've got is cats. So we've got a Maine coon in a Russian blue and they just provide so much joy for my girls and for the family in general. So highly recommend getting a pet, and especially a cat. Nice. I'm gonna step in here with a couple of picks. The first one that I have is a challenge that I've been doing.

45:50

This is a challenge that has been less fun with a broken arm, but it you know, I started it because I just I really want to prove to myself that I can do this, and yeah, doing it with a broken arm, it just I wasn't gonna wait to heal because it's several weeks to heal a broken arm. Anyway, the challenge is called seventy five Hard. It comes off of the mf CEO Project podcast with Andy Frizella, and I've picked that on the show before his podcast, but anyway, it's

46:17

basically a challenge that he made up. But it essentially is a challenge to prove that you can, you know, do what you've got to do for seventy five days. So there are five rules and if you violate any of the rules, then you have to start over the seventy five days. And the first rule is you have to work out twice a day for at least forty five minutes each time, and one of the workouts has to be outside.

46:42

So if it's raining, if it's cold, if it's hot, if there's a hurricane, you know whatever, you're going to work out outside. And basically the he says that that's just a you push you through the you know what. Sometimes you have to do stuff when the conditions aren't ideal. The other rule, you have to drink a gallon of water every day. You have to read ten pages of a book every day. You have to choose a diet and stick to it. No cheating every day for seventy five

47:10

days, So a lot of diets. You know, people are like, well, I take a cheap day every week, no cheat days, no cheat days on seventy five hard. And then the last one is you have to post a status photo to social media. And so yeah, I've restarted twice so far. The first time I forgot to read the ten pages, which was dumb. It was the one thing I kind of took for granted

47:31

that I do and I didn't do it. The other one, I got a salad from Coasta Vida and I didn't realize that I hadn't told them to take the rice out of it. And I've been doing a Kido diet, so yeah, so I started over. I felt really dumb about that. I was like, I know they put rice in it. I don't know why I didn't ask them to take it out. So yeah, So it's just kind of learning to adapt to some of this stuff. But I'm definitely

47:59

enjoying the pros. And incidentally, just to throw it out there, so I've I've been doing the the challenge for about a week and a half and you know, and I'm currently on day two. Just to throw that in the right because I had to restart. The flip side is is that I've lost ten pounds and know we can that's a serious program, like you're gonna be committed. Yeah, but he says it's a mental toughness challenge. Right, You're going to go and some days you're just gonna have to push through

48:30

do some stuff but you really don't feel like doing. Yeah, like the run to that I have scheduled today, it'll probably beat both of my forty five minute workouts together. Because it's it's one of my longer training runs for the marathon I'm gonna run in October. And yeah, I'm really feeling it

48:47

today, especially with my arm and everything else. I do not want to go out there and do it, but you know, I've got to suck it up and go do it, so anyway, but yeah, you know, I've got to go do two workouts tomorrow, and tomorrow's a holiday, so yeah. Anyway, so that that's my pick. If you want to go follow me on Instagram, I think my handle is Charles max Wood. Then I've been posting my uh my social media posts there. I tend to try and post them to Twitter and Facebook as well, but I'm not always

49:15

great about that. I'm pretty consistent on Instagram. So anyway, here, do you have some picks for us? To be honest, I'm not I don't know the like the format very well. If you can, just are there one or two things that you think everybody in the world should know about that way, right this one? I think it would be interesting for the

49:42

main audience like Ruby developers. A couple of weeks ago, I followed a hecking guide from MRI committers that shows you how to build Ruby, how to change some simple source and see how to rebuild it again and see how it works. Which also allows you to try all the new features that are coming with Ruby two point seven because you build it from the master branch, so you can go and try stuff like batter and matching. It's something that you're

50:14

excited about. And the reason why it can be interesting for any roupe developer to try is because you get to see all the magic behind it, just all the ce code, and it's becomes no longer just a thing that some room committers that I have no idea about build, and it becomes something that you can understand a little bit better maybe. And I think that a haicking guide was also made to reduce the bearrier to start doing that open source.

50:53

So I think this point falls back to the pick that need brought up about open source being awesome. We'll link that to very cool, yeaph cool. One more question. If people want to find you online see what you're working on these days, how do they find you? Yeah, it's a Katrov on Twitter or Kres on GitHub. Awesome. All right, well, thank you for coming. This is really interesting. I want to ask like a dozen more questions, but we just don't have times, so maybe we'll have

51:24

you come back. Thanks for inviting. We'll be happy to come back, all right. Well, let's go ahead and wrap this one up, folks, and we'll come back next week with another episode. Thanks lot, Bye bye,

Transcript source: Provided by creator in RSS feed: download file

Scaling and Shopify with Kir Shatrov - RUBY 633

Episode description

Transcript