Early Twitter's fail-whale wars | Dmitriy Ryaboy | Software Misadventures podcast

00:00

There was one outage that I distinctly remember long story short and without blaming anybody. A bunch of data and had to have got deleted. Like, 70% no backups, it got deleted. It's gone. And the two emails I said want to engine, want to the whole company are I believe still linked that Twitter Twitter has a go-link service. So if any current Twitter employees want to go to go slash best email ever and go slash second best email ever, those are both mine.

00:30

And it was like, you know, good news. We have lots of space in Hadoop. That news. I did some garbage. It was going to be an email I wrote with this sort of, this might be the last email I wrote as a company because I might get fired after this. Because we lost all the freaking data and my job with data.

00:54

It's sort of pressure makes diamonds. Later I heard people say, you know, the joke was that like, I love hiring extuited people because no matter how much everything is exploding, they just go like, eh, I've seen worse. Because there was stuff really, really bad, but also sometimes like the worst times and the best times. One thing you mentioned from the acquirer side, like doing the technical due diligence, which is something you were involved in once you joined Gengho.

01:21

So what does technical due diligence look like from the acquirer's time point? At what point do you feel like, oh, damn satisfied enough that this looks okay? You're looking for deal breakers, right? If you're doing technical deductions, unless it's specifically like we acquiring the magical technology that it's going to be magical.

01:39

And if it's not magical, the deals are not worth it, right? You use the acquiring for some other reason. You probably acquiring for a combination of there's some code and really good talent. And you know, it positions as well for whatever like strategic reasons, right? So if you're at the point where you do technical deductions, you're looking for deal killers. Not for like, I wouldn't have done it quite that way. Right? Like, that's not a deal kill. That just adds to your integration estimate.

02:06

Welcome to the software and misadventures podcast. We are your hosts, Ronnick and Guan. As engineers, we are interested in not just the technologies, but the people and the stories behind them. So on this show, we try to scratch our own edge by sitting down with engineers, founders and investors to chart about their path, lessons they've learned, and of course, the misadventures along the way.

02:37

Welcome back to the show, Dimitri. Fun place. And I thought we could continue this conversation that is your LinkedIn profile. You mentioned being a veteran of early Twitter, fail whale wars. Maybe you can take us back a little bit and talk about that. Yeah. So I joined Twitter on whatever the first working day was January 2nd, January 3rd of 2010. And Twitter was a bit north of 100 people and it was sort of very popular, but not nearly as popular as was going to get eventually.

03:12

And already in the throes of the infamous rewrite from Ruby to Scala, the fail whale was the error page. So whenever there was an internal server, the users that was like a cute cartoon of a whale, that was then christened the fail whale and became for a while synonymous with Twitter.

03:30

And there was like the first couple of years are basically spent the majority of the work was chasing down kind of one problem after another that was causing these instabilities while also rewriting it from the original Ruby on Rails app into kind of a microservices architecture on Scala and trying to keep up with the user growth and all that.

03:54

And that was a really kind of stressful and fun environment at the same time. A lot of very, very rapid learning as you know the whole company grew had to grow very rapidly. It was kind of the hyperscaler situation. When I joined we were in a colo cloud services existed, but weren't as popular. We essentially were renting servers in a collicated facility. And it was kind of a virtualized situation. So we didn't actually have good visibility into where the servers were in relation to each other.

04:25

And by the time I left we were running in like several data centers that were running up ourselves. Right. So it's kind of a major migration. And there's just kind of one aspect of the scale and the growth and a lot of I think for me those were like hyper growth personally years as an engineer is because they exposure to so many people working on these problems and so many of these problems and kind of how they were emerging.

04:50

It was really fun. It was really stressful. There was a lot of I really learned how to be on call well and how to respond to incidents and you know firefight without losing your mind. This was the data being active. This was like the data platform team right like did you guys have like dedicated.

05:09

There was a data platform team but like first of data platform on that getting involved in surprising ways. But second when I started off the company was small enough that they just serve knew the other teams pretty well too. And like I would get pulled in just to like figures tough stuff out. But to give you a sense of like how the data platform would occasionally get looked into total site outages in 2010 that was the first year I was there we were still in the call.

05:39

And we were extremely constrained by the network bandwidth within the data center between the host. And so we were getting the situation where we were getting a lot of errors because the web servers couldn't access the MAM cache the MAM cache servers. So there was like a big MAM cache cluster and there were the web servers that were very very light servers with meant to be stateless.

06:04

And their requests were timing out like they just couldn't get through to the cash servers, which is basically where all the timelines are on Twitter or at least were at the time. The way the whole thing worked was you post a tweet. There's a background demon that sees that a tweet happened in it materializes the timeline for everybody who follows you and updates it and sticks that in the cache. So when you actually read stuff it's not coming from a database. It's kind of prebuilt for you.

06:30

And I'm lighting a bunch of technical detail but that's the big picture. And so it's a very big problem when your web server can't hit the cache right because that means that you can't get anything. And long story short we were like we have to kill anything that uses any bandwidth whatsoever.

06:48

So for example we turned off the Hadoop cluster like in the particular moment when things got particularly bad because everything was sitting on top which other were like okay when the web server logs are being log to Hadoop to HDFS. That's taking up too much bandwidth and if God forbid HDFS decides to rebalance you know shifts a bunch of things between the data nodes like that will just flood the network we can't have that we just shut down the whole Hadoop cluster.

07:17

And the day we did that was the day that Jimmy Lynn joined Twitter. He was a professor at University of Maryland. He is now in Toronto I believe. And he literally joined because he was like I'm going to he was analyzing like graph data social networks is like when I'm going to use their awesome lip cluster. Sorry there is like literally no computer here.

07:40

It's just whiteboard and stuff and they issue turned out it was like it just didn't make sense but they should turned out to be that kind of the virtualized network in our call provided to us hit from us how they actually grew Twitter's growing footprint. And so you start off with Twitter certain size and the trends whatever and servers right of different configurations and they're sort of in a physical same pod somewhere in that in the data center.

08:08

And so you've got your MAMCache servers you got your web servers you've got your Hadoop servers that are like different configurations right Hadoop servers massive disk. MAMCache lots of RAM web servers very thin but like you have lots of them because you want lots of parallel cores. And then you want to grow your observers or maybe you want to grow your Hadoop cluster maybe you want to grow something else.

08:30

And there's no physical room in the pod anymore so they get the next pod right and they start allocating things there but to you it looks like a flat network. Right so now what's going on is eventually you get the situation where all of your cash will like two thirds of your cash are over here on the left side and your web servers are on the right side because maybe you didn't get to scale them as fast.

08:53

And so they have to go through the interconnect between the pods which is like really thin and your data nodes for Hadoop are spread across all your pods and so whenever you do any map produced job and it does a shuffle it just completely saturates the level of your web servers. Right and any server any call between your web server and your cash server is trying to get through like map produce trying to shuffle 100 petabytes.

09:22

It was probably 100 terabytes back then but I'm like we had no idea of that network topology one of our first sort of data viz hires actually spent a while working with the network engineers and the ops folks to do like pcapt captures of network packets and figure out which protocol is talking to which IP and sort of create a map so that we kind of figure out that like OK all of these servers over here and all of those servers over there.

09:51

And this is talking to that on the my sequel protocol and that's talking to this on the TCP protocol so now we know like who's doing what because we needed to re engineer that so when you ask like what the hell was data doing to you know involved in the fail well stuff like you think it's adjusting settings on your web server or like dealing with timeouts it's that kind of stuff right because the world was bananas.

10:12

So that was one there's also one where like our first move to data center was here at hilariously disastrous and there was both. And sorry sorry before going into that we saw but basically shutting down the dude so poor Jimmy but then that like that did solve the problem it's like from that decision then you guys were like OK well now the problems gone so obviously it's the ratio. I don't know that was just to get through like World Cup or something. Like let's share that one more time.

10:49

So the one story leads to another because then that was like it was obviously so first we were like we'll run the HD first we don't want to lose logs logs were writing you know describing to the distributed system will just not run any map or use jobs because that's right and then we discovered that the occasional rebalance of like the data nodes making extra copies on other nodes and not knowing about the actual layout like even that.

11:17

That would cause enough chatter that it would create a spike of errors on the Twitter website so that we had to concentrate the whole thing completely down like we just we just ate not having logs which is hilarious when you're trying to troubleshoot a problem the first time.

11:32

I don't recommend this this was definitely a kind of last thing that story just comes to light because it is so particularly like egregious most of the time it wasn't stuff like that but yeah there was like a bunch of yelling at our at our hosting provider to like get them to move servers around to get you know get them to add network or like that how you actually solid and also you know try to.

12:01

I think we had we did some stuff where we added some small caches on the web server so that they could be. No, it's a more like user pinning so you there was like we started getting clever but the first thing is just. You know address the media pain and then like re architect on the architect and then they mentioned it will get fixed right.

12:21

It was one of the last time probably first it started service second fixed the problem right not the other way around the engineers really want to but why is it happening all done like let's really observe it no don't really observe it like the site is failing it's end up. I literally said that this morning to a colleague of wine or just like mitigate first and then figure out why it's happening.

12:44

Yeah like capture and that's where like as you get experience you also learn how to capture the right to let me treat right like. Get your core doms like capture all the possible state that you can capture and then like if the solution is rebooted and it mysteriously works like reboot it. Yeah for now that's right. I'm gonna start on a staging server right.

13:05

Yes. Like with the decision like that big to be like okay let's be done with this Hadoo cluster like how do you go about even like escalate that of the decision because there's multiple teams not just to be that are impacted by this right like. Yeah I mean it was a very small now people are like well how would you do it at Twitter right now I guess I'll just say shout out.

13:32

But it was a very engineered driven culture and so you know when there was a big problem I think I mean we had a CTO but I couldn't tell you who was between the CTO and like my boss. And I was like an engineer at that point that I wasn't managing it.

13:51

I don't know that we had like VPs or directors or anything like it was the people who are sort of like well there's the guy who knows about the web services and you know the woman who is like queen of cash and like an obs guy who knows all the op things and they're like oh shit everything's on fire.

14:11

What do you think about shutting down the Hadoo cluster we have to ask the me tree about that I go like well I want to site to work so yeah let's shut down the Hadoo cluster I'll tell everybody that's kind of it. It was very organic or lot of war rooms you know what was saying. I was just saying it sounds sounds a lot more fun environment to it there's a different kind of fun and being in this environment where you're just going through a bunch of fires and learning things together.

14:41

Yeah yeah absolutely it's sort of pressure makes diamonds later I heard people say you know the joke was that like I love hiring ex Twitter people because no matter how much everything is exploding they just go like I've seen worse. There was stuff was really really bad but also sometimes like the worst times are the best times yeah for sure I think like this is an experience that a lot of.

15:06

Engineers don't necessarily go through unless they worked on the upside at some point in their careers and this is something I see pretty regularly when it comes to incident management like you see a bunch of let's of for example folks who've been a series in their past lives are a series today.

15:21

No matter how big the outages like you see the calmness and how they're dealing with the incident you see how they are able to walk through the problem able to mitigate it in time instead of like what the fuck's happened inside some fire. Yeah well there's took me like not freaking out helps and also there's a lot I think I put some of that in the book or maybe I didn't just meant to but there's a set of skills you learn for dealing with it.

15:50

And for dealing with an incident like it's it sounds simple when I say it but like so many people don't do it giving updates right you know the thing is broken you're working on it even just saying I'm still looking at or I am looking at this particular area is anybody else looking at another area.

16:10

How's it going with looking at that thing can you in like dropping in observations in the slack nowadays is like right in a slack thread of just like screenshots and oh this thing looks funny or whatever just like being vocal being loud you know certain dedicated space for a series incident not for just like some minor troubleshooting right it is extremely helpful because then you can get multiple eyes on it and people know what's going on you there are so many even very experienced and smart engineers who.

16:39

When faced with something is failing they kind of go dark and then like they might show up at some point you have no idea right they might be working feverishly hard just like going all out but because nobody knows anything it's it's very hard to manage the incident it's very hard for your support people to like support you and support the users it's hard for managers to answer you know CEO emails or like the hell and a lot of it is just about communication you can be doing the same exact things but good.

17:08

Good communication the incident will go so much smoother and probably be resolved so much faster yeah remember one of the feedback that I've gotten from like the second manager I had pretty early was like you need to bring the team along which at the time right I was very young and foolish was still foolish today I was like what do you mean man I've like tried to do all this like you know fixing all these things like I got time for that but now that you know I'm on the other side like holy.

17:37

Yeah like that's such makes such a huge difference in terms of actually creating a sense of like oh yeah we have this under control like there's a process to it rather than yeah just having things being very chaotic. And when a lot of people in bar because like maybe there is a problem where it's just unclear what exactly is going on right like we're seeing this errors but you know we're living in the microservices world right like these things propagated weird ways that back pressure builds up.

18:06

Who knows right what you're seeing as well the sites airing and my sequel has errors are the my sequel errors and all related to the site airing we don't actually know. Hey where where else can we look there's 5000 graph on a jet dashboards right is just like there's a lot to do when it's not clear what's going on and the systems complex and now I'm thinking more like Twitter circuit 2015 right when it's just there's a lot and there's a lot of instrumentation and it becomes its own problem.

18:33

Having experienced people who just take on the role of incident manager not just from like sometimes incident managers like the person who does the reporting says we will update you in the 19 15 minutes but just sort of coordinating traffic cop right okay we're exploring these three hypotheses about what's what's going on here's where we are on this here's who's working on them.

18:59

Anybody has any new ideas right like final on through me so we can keep everything organized right just kind of keeping everybody coordinated on that also super super useful and there was a lot of stuff like that that we either. Invented or reinvented or or learned you know in those first few years of Twitter going going but at us yeah it was really fun there was a data center that cut fire. That's a great start. It never ends we can just but get to go for a very long time.

19:35

We're entertaining is what I would say. Okay wait so the recent or that got fired yes please go ahead. You know we moved in a bit early the data folks were the first people to move into this data center to do is building out and it was sort of

19:52

the our color host couldn't keep giving us space so we needed to move somebody out and it was like well the offline data doesn't need to be as physically close and like if the connection drops or whatever we can survive that nobody wants anybody to be out but offline data processing like it's both big and separable right so it makes sense for us to sort of be the first through the bridge.

20:16

But they were kind of still building the data centers that turned out and we are rags can installed in there so at some point there were like guys welding something on the roof and they didn't. Protected properly and I guess there were sparks and so. That's the context of you should be something else but no literal fire fire through the roof and then so there was a small conflagration and then the sprinklers turned on right and so.

20:49

Flooding so first you have fires and you have flooding all while like the servers are supposed to be running. That was fun there were yeah there was there was a bunch of things there was one outage that I distinctly remember long story short and without blaming anybody a bunch of data and had to get deleted like 70% no backups it got deleted just gone it was sort of a combination of. Misconfiguration of a tool that allowed you to do that surf thing in the first place.

21:27

Because normally you wouldn't do it but we had to and then in that case like a slight misconfiguration was catastrophic right as usual this happens over the weekend you know the person who ran the faded command reported to me we figured out this was happening we're like in the office the small office like literally sitting back to back his trying to fix the thing i'm trying to recover what we can recover and update the whole company about what happened.

21:53

So there were several things out of that one is I wrote several emails kind of updating the whole company was just like all at or something whatever the. And the two emails I said want to engine one to the whole company are I believe still link that Twitter Twitter has a goaling service so if any Twitter current Twitter employees want to go to go slash best email ever and go a slash second best email ever. Those are both mine and it was like you know good news we have lots of space in Hadoop.

22:27

Bad news. I just got it in my mouth with the sort of this might be the last email I write this company because I might get fired after this because we lost all the freaking data my job with data and you know the engineer was modified I serve I don't even know how many people still know who that was because like I just kind of.

22:54

Rot it all from first person was like this happened if you have any questions ask me blah blah right like kind of try to give him space and cover and I remember my boss at the time basically was like one. You know this happens don't worry about it the got you and to eventually this was going to happen it doesn't feel like it right now but it's it was like that first year it was in your 2010 or 2011.

23:23

It is going to be such a relief that this happened now and won't happen anymore in this company then if this happened like three or four years from now. And sure enough like the amount of data will lost it seemed so massive it was like everything we were generating that much data in a single day three or four years later. It's losing that data would be like yeah you know it's a days war like it's it's not but and.

23:53

You know when when like public reporting is based on data and I do when all the machine learning recommendations right like just like everything is so so tied into the data platform that we build. That would be so much more impactful and so much more hurtful that first year it was like okay well some of the operations we used to do wouldn't be able to we can do and some but we can recover but we'll take us a few days and. It felt huge at the time and in retrospect. It just.

24:28

It didn't matter that much like it didn't affect the trajectory of the company at all like the in the big picture it didn't matter. One of those things to happen but also it's useful to have this sort of perspective of how bad is it now versus we're going to where are we going to be in a couple more years right and obviously doing the right thing when it happens right when it happens.

24:49

Making sure it can't happen anymore it's it's so wise what your manager said right in terms of like oh this is like bound to happen in that okay now I say it sounds a bit cliché but. Where has he worked before that give him sort of this perspective so this was Kayleigh to Orgerson and he had his own sort of data consulting practice before Twitter convinced he was consulting with Twitter for like selling everything up and then they convinced him to join full time.

25:20

I think before that he knew some of the folks of on that being a Twitter through maybe his work at or with Yahoo something like that remember when Yahoo was a super legit that company.

25:37

It's back in the day it's interesting I've heard this sort of argument before right where if you're having a lot of issues really trying to fight for this idea that's really about prevention and that's super important and sometimes the answer is just hey just let it drop right like let it actually you know catch on fire and then people like actually look at what happens in order instead of you wasting all this time trying to you know advocate for it is that yeah.

26:06

Something that you've like you agree with or you've seen it done well. I think it depends on the specific thing that we're talking about failing right there are some things that can't fail live right I don't know missile control systems right like things involved with people's lives operations that are regulated right like you. You can't die fraudulent transactions things like things of that nature but there's a lot more that can fail than people think try to remember the expression.

26:46

It's something like we all test and prod. Some people also have a staging environment.

26:57

And I think there's let's talk about now a few years ago people talked about it a lot but the notion of testing and prod and if you know the book accelerate you know they serve lean on that stuff where the idea isn't that you only test and prod the idea is that you know that stuff is going to fail and prod and if you trust so much you are staging testing in your QA and acceptance and everything else then you're going to fall really hard when things fail. And prod.

27:28

I think nothing that's your system like real life and real users. So if you start from the assumptions that things will fail and you need to have the observability the ability to debug the ability to capture all the data you need to understand things right and the ability to recover from errors.

27:45

So save that property state so you can do replay expect that transaction might happen several times and use that importance and see things of that nature you'll be in a much better place right so it's less I think about just let it fail but more like expect that something will fail that you didn't expect to think about how you're going to recover right so.

28:07

Some of us are lucky enough to have been actually taught something about distributed systems all of us who are working on the web are writing distributed systems in just like some folks haven't been told that anytime your observer makes a call to a database congratulations you're in the

28:23

if you made a right to a database the database might be able maybe I'm able to understand transactionality but like you're outside the thing if you have multiple web servers trying to make the same right they might write it three times and the database won't know it's the same thing unless you thought about it they have

28:38

right it's a distributed system and a lot of folks are just sort of I make the call and then the thing happens and well what if you make the call twice what if you need to undo the call because it's actually part of a classic example shopping carts right I tested it I tested it with my unit tests how would this happen yeah or you create the crazy elaborate systems that just take in some cases they're very good

29:04

investment but not everybody can make that investment and serve being prepared for these kind of errors and being able to detect the errors being able to recover from the errors is is hugely important and I think that goes I mean that's a data engineering thing but it's also any kind of service engineering thing right yes and operating 24 seven web service or web based service you need to know how you need to think about those things and not just

29:32

rely on sort of testing testing is good definitely test test as much as you can so some of the things you're mentioning are also in the book that you have which is the missing read me will definitely be a great and the show notes and highly encourage people to go check it out what I get like that said for copy we also I can see where to buy the book not just

29:56

take it out what what prompted you to write the book in the first place I think it was running into situations multiple times where folks who were new grads or kind of a couple of things that would be good programmers and very capable of doing whatever we needed them to do but wouldn't know these things that a lot of us take for granted that are kind of unwritten once you're in the industry for a while

30:25

you can pick up how they industry does things and because we sort of just picked it up from our peers we forget that it's knowledge you have to acquire and then the new person is a long and they're sort of struggling for a while and then maybe they pick it up it doesn't have to be that way right and a lot of problems can be avoided if you just explain why things are the way they are

30:51

and so after yet another round of sort of explaining to somebody who is very bright intelligent capable but is doing things kind of the wrong way just because nobody ever told them what observability is or like you know logs are not just print to stand there or something right or like what it what goes into a log right is just stuff like that that is it's not rocket science is just stuff that you need to know and you

31:20

can eventually learn right just why don't we just write it down in one place so that people can attack my vision for it was attack it has a stack of these things on their desk and they get their new batch of new

31:32

grads new new grad hires or interns or whatever and just like read this over the next three weeks if stuff is weird you understand why we're doing a standup or something right there is answers probably in there but I can also happy to help you but like here's an answer so that you can just read that and will be like 80%

31:50

there and so I shared that observation with somebody I knew from from Twitter actually like on Twitter not from the company Christopher Camini who was also tweeting about something on those lines and serve we decided that yeah it would be fun to actually write write a book we played around with ideas of courses and other things but settled in a book and it was really fun

32:14

and it was surprisingly hard sort of write down the basic stuff oh how does the basic stuff actually work in that process you learn like new things that even though right like you said these concepts are some are our fundamental but like yet do you learn new things in the process writing it

32:30

I think so there would definitely some concept to serve I knew but I never examined and tried to explain that trying to write down caused me to re examine I also read a lot of sort of different approaches to how to build the available systems or what kind of log goes where

32:50

how to handle exceptions you know dealing with now pointers and at some point you sort of just have to decide which which of the sets of ideas you go with but that part was what was really interesting like most of the time writing the book isn't the actual writing it's beating I'm trying to synthesize like what am I trying to say here and what do I actually believe in I read the argument so I actually believe that argument

33:14

so there was a fair amount of that we also we have a chapter there about a scrawl and agile even though I'm not a particularly strong believer in capital S scrawl and capital A trial it is hilarious to me that the agile manifesto is the first thing it says is something like people over process or getting things done or process something like that and then like we codify these super elaborate processes and and in you know there's like the

33:43

retros and the standups and the planning poker and this and that is just like so much of it and people talk shit about agile and scrawl because what they're exposed to is the process and the process doesn't work right so I spent a bunch of time in the book kind of explaining what is the process trying to do and trying to say that understand what the why it's there and then you'll be able to use it or toss parts of it out what you can do is either just say

34:11

this process stupid I'm not going to do it and I just like have no process because you're probably going to fail these things immerse for a reason or follow it you know the sort of cargo way right you serve well we're supposed to write stories in a as a blank I want to blank so I'm going to write a user story that says as a cis admin I need you know get it to you

34:37

to update it to 1.3.7 it's like no that's not a user story that is just a waste of words if that's all you want just write down when you top I'm writing enough hours exactly that I don't think like this seems really stupid but I mean you know what's funny like the number of hours I've seen spent on how story points should be should they be based on number of days

35:06

number of hours some random thing you make up and you say well it's logically this thing or relative to how much time other things take and so the reason I was laughing so hard to the process because I had a colleague on my team who was a scrum master I love working with that calling anytime you would have a stand up he would basically say I'm going to wear my scrum master hat and I would always laugh out loud when he would say that

35:30

I'm like I care the least about your scrum master hat anyway everything that you were saying was just reminding me of all those conversations and how many hours people burn to with the process and to plan to get stuff done and sometimes like the amount of time it would take to get that done is less than the amount of time it takes to just put the process behind it.

35:51

Blind following of the process is very bad and the guy who invented points is now like that I would like to take that back because people did not get what I was trying to do here.

36:03

Whatever it was meant to be the sort of the agile consultants have taken over but there is merit to trying to say like these tasks are different sizes like how much can I actually take on having some sort of methodology to not commit this without that stuff every single team I've seen right like you commit to a bunch of stuff and at best half of what is done at the end of your time box.

36:31

Because like you're kind of better at estimating world better estimating let's acknowledge the fact that we're better estimating and bake it right how do we get better at estimating well like if we had some several projection what could you have it our error is not a good job it not but then so like the logic makes sense it's there like always a point a day like should it be a

36:50

finbenachi sequence like there is reasons for making it a day there is reasons for making it for benachi sequence but if you take it as religion and dogma it doesn't do anything like it fails when you kind of take it as dogma it works when you understand why it's there and just like with any kind of you know like when you're in high school and they teach you the the essay

37:13

structure and it's very stilted and they insist that you write it that way and you're like but when I read you know John Didian or somebody like who's an excellent writer they never do any of this and that's just like that's because they know the rules when you know the rules you know which rules to break and which rules to stretch when you don't know the rules you just write nonsense that is impossible to follow

37:33

so like learn the rules understand what's there once you understand what's there absolutely like toss it don't and don't follow the thing without sending it there you know I did not expect this to be an agile was hoping to hear about the engineering I think every engineer didn't talk to them about agile has some strong emotion associated with it either they hate it or they love it

38:01

all of that so in terms of writing the book like you mentioned it was a very hard thing to get through and some other folks have also mentioned that writing a book is not just a lot of hard work but it might not be as lucrative as one might think did you know like what no one I want to know your opinion whether that's true or not and again I know the measure of lucrative can be very difficult because the issue is giggling that's what I don't know what's going on

38:34

but I'm trying to validate what I heard and whether that's true or not the other part is like if you did you already know going in that this is what would happen so first of unless you have some sort of ridiculous hit on your hands and there are a few chances are you're going to make like pennies per hour for them out of time you spend writing a book

39:00

so for a financial perspective it doesn't make sense for some people it makes sense if like they're using it to advance their consultancy right or create a brand in part of the brand is an author of technical books

39:18

or there are some people who write this runaway hits and they probably do make decent money from it although I suspect the ones that I'm thinking of were not setting out to make money they were just like this book needs to be written so I'm going to write it and then if I make some money for them

39:35

for example you know in the data space Martin Kleppmann's book right designed data intensive applications is it's like 10 years old now and it's an absolute classic but everybody gets their first recommendation for if you're involved in data or distributed services are all you have to buy this book by the way you have to buy this book

39:53

so he's probably made some decent money out of it I highly doubt he wrote it because he was expecting to to to to like cash in right like he wrote it because he felt that this book needed to exist and he's very good at explaining things and has a very cyclopedic and broad knowledge of the space

40:13

so yeah I knew that going in I know a few folks who have written books before and they all said that I think Josh Wells showed up on your podcast so him and and a bunch of others I was pretty lucky to work with folks who were very knowledgeable you know published a bunch of our it was the sort of thing that I felt is a is a service I was hoping the division wasn't piles of money and you know my bank account it was stacks of books on tech leads desks like the vision I just

40:44

wrote before and I was like that would be awesome and cool and I want to make that happen so we can just generally lift up the level a couple inches yeah so that was it what is the process of writing like especially when you're working with the co author so do you want to know the actual writing or sort of the pitching and everything else actually both so one aspect is like you both a few like you saw the

41:10

wrong Twitter and you're like hey seems like there's some common topics we are talking about so let's get together and write the book but then when you come together like what does the process of writing a book look like because and the reason I'm asking this is we've spoken with folks about writing in general before it's something that is

41:27

a topic that both of us are interested in and we've also heard a lot of our listeners interested in like how to get better at writing and one aspect is like it's hard enough to write a block post let alone write a book so what does that process look like going from outlines to chapters to the final book yeah and I would also love to know some aspects of like finding a publisher for example like how do you go about doing that and things like that so I guess going backwards a little bit

41:57

in terms of the actual writing when you have serve an outline or at least a general idea of what the different chapters are I don't know how it works for other co authors for me and Chris it worked pretty well because we're both pretty experienced open source contributors and we're very used to this general idea of

42:17

I'm going to do a thing there's some people way in on the design and like what to watch out for maybe have a first draft of it and they say oh don't do it that way do it a different way and you rewrite it and maybe they like put a patch on your patch to serve adjust some things and then you know some third person says hey you should have a test here or there right like you just have this very iterative heavily reviewed collaborative development process

42:42

and both of us serve marinated in that kind of environment like both of us are Apache commiters and on Apache PMCs so that part of the collaboration can be very natural right we just I wrote a thing I have hated but there's some stuff here please take a look he does a review edits it you know with the edits visible in a Google dog

43:08

accept accept accept oh let's talk about this gigantic sidebar discussion and comments you know finally resolve and so on but it felt very natural like it felt like writing code except except in English you know so that part was great

43:25

we had some hijinks in like going between Google dogs and trying to use get the trend to use word Microsoft word live because of like publisher constraints that part wasn't so fun but the actual sort of exchanging comments and figuring out what we're trying to do I think we had like maybe three calls slash video calls through the whole time the rest of it was just back and forth in the docs

43:50

I was like a month ago Chris that I hadn't met with known each other since like 2011 we hadn't met in person until a month ago so you don't you don't have to go to the person wow yeah yeah and we live like an hour and a half away from each other

44:07

but it was like pandemic kind of yeah but but yeah that part was great and we serve in the first step of that was writing the page and so making sure that we're on the same page and talking through like what do you actually want from the book question what should be in what should

44:23

be included why is this important right just making sure we're on the same page and I think both of us kind of there's enough mutual respect you know that we don't feel it's a loss of face to give on something or and we trust other person like if we're not there for instance something that like they know they're talking about right so that was good in terms of finding a publisher most tech publishers have very clear sort of proposal guidelines or I Lee has a Google

44:56

talk that you can copy that basically you fill out and it asks you know what is this book about why are you the right person to write this book you know where it gives an outline are there other books about this how's yours different those kind of questions right and you just like email work with us at or ID dot com and like they get back to you and then for missing remy we actually want to go with no

45:19

starch which is a different publisher because we thought that they got what we're trying to do a little bit better because missing remy is kind of a weird book in terms of defining its audience because it's a technical book but it's not about code but it's not a it's it kind of straddles

45:37

much of things and for that one we also we had a draft chapter was chapter six the first thing we're out was a chapter and test because that's what everybody loves right and we had we asked for a sample edit from from publishers like the publishers who were like we would like to publish a book was that okay we will have several publishers we said they would like to do that.

45:59

We want to go with somebody who gets what we're trying to do can we see what it looks like because we did we hadn't ever worked with that. This is before you said yes to the publisher they were at the chapter that you wrote. Yeah okay that's pretty common but for I don't think that's common but we were in a situation where we're choosing between publishers and.

46:23

They were pretty involved and both of them were sort of pushing the book in different directions and well how do we decide I think the real work will be like what is the actual value of working with the publisher versus self publishing on online right it's a professional editor and the distribution network right and so we wanted to find out we understand what the distribution network does we don't understand what the professional editor does because we hadn't worked with one so we wanted to see what that feels like.

46:52

With this we got notes back with the super different like the from like the different editors like the tone and yeah. Yeah yeah it was it now it's been years so I don't quite remember but I do remember that both the sort of the kind of comments we got and the volume of comments we got was quite different and I think we actually went with the ones who gave us more you know reading because we wanted the tough love.

47:21

And they were very helpful you know financially self publishing probably would have gotten us more money but fewer readers is because you take a much bigger cut when you. You know publish on Amazon or something you don't get as much distribution like not as much visibility but I think the.

47:41

The editor contributed a lot and sort of pushed us to state to some conventions and structure that we were deviating from being sort of newbie writers really helped us also clarify things be more succinct sync kind of very I would sometimes go on as I do in this podcast and have like flowering language and we're just like cut that out ruthlessly just like make it very.

48:08

Very easy to read you know like appeal to a broad audience not of all the former native or very fluent English speakers you know things like that so I think it was very helpful. So this was an interesting thing like some of the block posts that I like reading are people kind of describing things in a very organic way as if like if you're reading their block post as if the person is talking to you.

48:32

When example that comes to mind is Tim urban he has a block or wait but why amazing blog I would highly encourage folks check it out when you read that his posts are super long but they're extremely easy to read and the other day I was.

48:47

I was drafting a blog post and I was thinking about this only and I'm like okay I can make it super succinct and cut it and make it crisp on the other hand I can just write as if I was talking and I like the latter part in terms of when I read stuff in general I just tend to attend to read those fully read those block was as opposed to the ones which are very formulated and because it's just easy to read that so when you're trying to.

49:15

Yeah yeah yeah yeah yeah we're look up recipes online recipes online yeah yeah yeah all the time yeah do you read there like I was going out for a walk with my dog walking the ball no I just no no no just like the bullet the bullet points like step one step to step three yeah.

49:34

Okay make sense what are you saying is depends on the content of the block of the book in this in this case and how much you vibe with it right there are actually some recipe writers where I like I enjoy the essay enough I'm like yeah I can enjoy that's because it's a great of writing right like I just like writing interesting essay about like whatever what the smell of the supervolks for that when it's in this engaging and somebody just like try.

49:58

Like you're just patting the page so you can shove more ads in front of me like yeah just give me the ingredients right now I think it really depends on the writer and the audience. I think for the Portuguese egg tarts and the state for the writing. Oh no no we're trying to go with this is is it you who defines the tone of what the content should be or is it up to the editor to push you in a direction and abide by the voice that they want you to have in the book.

50:33

I think the publisher reserves the right to not publish your book if they think it's bad. Okay, but you don't have to take the editor like ultimately you're the writer the editor is making suggestions. Interesting. Yeah so like my personal blog and Chris on his blog have much more of a sort of individual voice maybe but yeah it was a little bit more sort of clear succinct.

50:59

The more you insert a sides and stories and other things of that nature the more sort of you stand the chance of losing the reader. Yeah that's fair in this case for folks who might. Nourle with the idea of writing a book would you recommend it I'd say. No why you're doing it because it'll take longer than you think it'll be more work than you think. And it can be highly rewarding you know I hear from people who've read the book and and I get such nice feedback you know it's.

51:34

It is it is very worrying to know that it's out there and that it's actually helps some people definitely don't do it for like money or fame. You know but if. If you're driven to serve like I know this book needs to exist and. I don't know who else will write it so I might as well do it right and you think you can actually go through it in the end like any large project that you. Finish and are proud of right is has its own reward but that's how I view it right I got some money out of it it.

52:09

It paint for that stupid Microsoft Office live license. It was it was more than that it was more than that but it's definitely not you know I could earn way more money with that time doing like. I don't know expert interviews or something. In terms of roughly a number of hours I know it's super hard to quantify this how long do you think this took. I don't know it was so spread out over the pandemic year and I didn't keep track but easily in the hundreds like low hundreds maybe 200.

52:44

I think Chris published a Chris are coming my author published a blog post about writing a technical book when it came out and he might have an estimate for much how much he's been. Oh true true cool people could Google it. Yeah and find that on link in the show. Yeah go ahead go on. Do you think you would have done it if it was just you.

53:07

I'm asking because for me for this podcast for example Roneck has been huge in terms of kicking my ass being like you'll go where's this thing you promise to do two weeks ago. Yeah go go you know I'm just kidding just kidding. Yeah but did that kind of play into a factor in terms of like getting the thing actually out the door.

53:27

I think definitely having a commitment to somebody else and knowing that they depend on me to do my part definitely help me actually finish it probably wouldn't have finished it if it was just me being like I could sleep in extra or I could get up and go right like I'll go for sleep. I'm also very grateful for product thank you thank you. Same same one these kids gone has been thinking every day or things I need to do. Anyway cool yeah so maybe the doing a bit of a pivot in the last 15 minutes.

54:00

But yeah let's talk about acquisitions so this is something that you mentioned before you've been on like sort of both sides of it. And the reason why I'm very curious is that recently I had a friend who sister kind of went through this process of they worked at a startup for like a few years and then they like got bought out by like a big company.

54:20

And they seemed pretty interesting in terms of like how you think about like how much equity is worth right like in the interview process right like you know if the company is really good they maybe give you some projections right in terms of like oh yeah if we exit for this much money right this is how much we're stocks are worth.

54:37

Yeah like just very curious in terms of how to think about acquisition I guess from like engineers perspective and also like how do you integrate with a company and all those sort of things.

54:47

Yeah I think I don't have a lot of insight about the financial part of it you know stock options are not real money until they are and all those projections about like if we exit for this versus that like this is how much you get don't factor in dilution or if they do it gets very complicated and their cap leaders online like we can play with those but it's it's monopoly money you know.

55:14

But in terms of integration yeah so I was kind of on both sides of this I've been at Twitter when we acquired companies and we integrated them into our teams also a zymogen and the ginkgo by works at ginkgo I led the position a couple of things or the technical dediligence actually at all three of them at the technical dediligence and.

55:35

The biota company that I worked for zymogen where I was the CTO and the pn got acquired by ginkgo and so my job was just to make sure that the software team gets properly integrated into a new environment and I think there are several different ways that this can work and it's in it's important during the acquisition process for all parties to agree servant on what they're doing right if it's a.

56:02

Equate higher or serve intended to be a growth of the acquirers engineering team it's one thing if it's you know the. Facebook acquiring Instagram or WhatsApp kind of situation where you're acquiring a product and the team and it's going to be standalone it's very different right.

56:24

You definitely don't want so and I think the second case is you want to keep the team intact you want to have a conversation upfront about what doesn't doesn't need to change in on what schedule in terms of like transitioning on to the parent company's infrastructure in terms of. Adopting their standards versus not adopting their standards and so on most of the positions I've been on either side of have been more of the first nature where you.

56:55

Integrate the teams and the danger there is that the integration doesn't doesn't go and like for years afterwards people are like well you know this is company a engineers or company a practice you know that's why things look the way they do.

57:10

And you want to be very clear about sort of how and in what order you're going to integrate things or if there are competing solutions which once taking over and when and kind of work aggressively to merge one thing I did very intentionally in the ginkgo's amateur acquisition was work with my counterpart of the acquirers side to as much as possible have teams that are.

57:36

Because the teams this after engineering teams were actually roughly similar I think we were we added like 30% of the engineering team maybe even more so shuffling the teams so that. There are acquired engineers on multiple teams within with the acquirers department and that does two things one it sort of prevents that us and them kind of mentality right because like your team is your team now and that helps get over that very.

58:05

Mentally but to you also have these kind of tendrils in multiple teams right so it becomes that much easier to find out how things work when you're like what's going on with whatever core services team well I know somebody there like it's baked in right because I worked with her you know in my previous company and like so whatever good stuff kind of comes with your culture can serve introduced to multiple teams out once and you know what I'm going to do is.

58:34

And the communication flows a little bit better because so much in a larger company of communication flows through not like organizational lines but who you know and it's him coming in as an acquirer team doesn't know anybody right so fundamentally this advantage and I mean this is an

58:54

just like a competition of us versus them but like it's like you're joining new company except you didn't even interview right they have no idea what they're doing or why you're having this kind of support network that straight throughout. Let's the organic network work.

59:10

So I think that's pretty important and yeah remembering that that choir is that choir and there's a reason that acquired you if they were hoping to modernize their architecture and you come in and the teams are like but our stuff work well with respect.

59:29

The whole point was that we're going to change some things so let's talk about what things make sense to change in which things don't make sense to change and maybe not change all things at once but we're going to modernize the architecture because that's the whole point of us being acquired and like you can't come in and be like you all are idiots you're doing everything wrong we're going to do things our way you know like how you possibly have made the classics are like it's a legacy system like of course there are many things never open.

59:58

If you looked at your system with fresh eyes you would also see a bunch of things that are broken right like chill out a little bit cut out the well at whatever ginkgo we at Google we at whatever right just try to find what to appreciate in the in the new environment and remember that there's a reason that brought you in and there's value in what you're bringing to.

01:00:20

So in a way as you're going through the acquisition process sounds like that I might do a poor job of describing it but two goals you're trying to fulfill one is from a technology stack standpoint a successful merging of teams means you don't have to independent stacks but you have one stack where aspects of the new stack are kind of embedded into the new ones slowly evolved parts that you would still keep but towards the goal of reaching a one and it looks the same.

01:00:48

And the second aspect is the team where the teams don't end up being to independent teams but rather one team and people are bringing some different ideas to the table from the acquisition side better the end of the day you have all of the new folks spread across the entire company to kind of spread that connectivity connect it to you in a way.

01:01:08

Yeah, yeah, I think that's right I think you captured better than I did actually yeah you got it and also especially when it's a small star being acquired by a much larger company knowing that there is opportunities to do something else in part of the then something you get at a larger company that you don't get at a small company is an opportunity to sort of get a different job without changing your jobs.

01:01:33

So if you've been doing whatever data infrastructure for a couple years and now you want to do I don't know front end right like you just really like there's a way to transition that you can just make the connection in transition and like they might have the support systems and all of those things or just being like you know I've been on I don't know support systems and I want to go into ads right when companies get acquired by the like the company.

01:02:02

Yeah, I think I just get acquired by the likes of Google or Salesforce or meta right there's just so many different things that engineers can do right it opens up opportunities in internally.

01:02:12

And a lot of the times I know from the time we were at Twitter and Twitter was pretty acquisition happy for a while and acquired small teams that were really good and some of the best acquisitions want that being you know it's for people that a year or two later all work on different teams but it's like oh yeah they came.

01:02:31

From that startup like they're already good you know and their influence is felt kind of throughout the organization they're not necessarily like a small little team that you can deploy to different places to the fixed things right is just that they wind up having having influence over place. Yeah. And one thing you mentioned from the acquirer side like doing the technical due diligence which is something you were involved in once you join ganko.

01:02:54

So what does technical due diligence look like from the acquirer's time point it really depends on the context right that you might go through the tech stack understand start developing a vision of how it would integrate with with your tech stack depending on the nature of a position you might get to entry people or you might just get to talk to sort of the.

01:03:16

The CTO or sort of like the engineering leads but that I want to tip their hat and they want to turn away the rumors to start so they they don't expose you to the team so it can get a little bit nuanced so maybe you'll see some design docs and things like that there's usually something called a clean room that like the but when there's a once there's a fairly strong understanding that this might actually happen the company is being acquired start sharing a bunch of information about their financials about their.

01:03:45

There are a lot of their capital about their product about their customers you know like all that and it goes into a sort of a third party where you can look at the documents there but then if the deal falls through you just your access gets her vote right so as an engineering leader you might get access to that sort of thing and look through an ask additional engineering questions and so on as an IC you might be invited into sort of.

01:04:13

Architecture discussion or presentation if things are a little bit more open. Yeah and so they're like we really want to understand whatever how the streaming platform works where I think let's get our streaming platform person. There the only person understands any of us we need them in the room.

01:04:31

And in this case like many times you the actual system in the works is rarely the same as what you have in the design docs partly because in design docs you start somewhere you start implementing in the system involves a slightly living organism of sorts.

01:04:49

When you're doing the due diligence you want to make sure you understand not just a good part of the system but also the limitations and it's not that the person on the other side is trying to hide something but they might not see the limitations the same way you do so they might not be as forthcoming with some information as you might expect it to be.

01:05:07

So at the end of the due diligence like what does the goal look like at what point do you feel like okay I'm satisfied enough that this looks okay. I think you're going to it assuming that things don't work 100% of the time or 100% perfectly and you're looking for deal breakers right if you're doing technical the dealers unless it's specifically like we acquiring the magical technology that it's going to be magical and if it's not magical the deals not worth it right.

01:05:36

So you're just acquiring for some other reason you probably acquiring for a combination of there's some code and really good talent and you know it positions as well for whatever like strategic reasons right so if you're at point where you do technical the diligence you're looking for deal killers not for like I wouldn't have done it quite that way right like that's not a deal. That just adds to your integration estimate.

01:06:00

Makes sense I think that's yeah that kind of puts perspective I mean puts things in perspective for me at least that makes sense. Yeah an important thing there is not so much yeah to like look for problems but look for what will it actually look like to integrate or incorporate this into what we're doing right like you know how are they doing a

01:06:26

integration authentication is that going to be easy list and shift or are we going to like have to rethink the whole thing right some subtle gotchas that are kind of deep in the weeds but can really change the the timeline for like getting a thing to actually work and maybe like adjust the strategy right like okay we

01:06:45

should let this run separately for a while because you know they run on GCP we're on an AWS that's a massive migration maybe we should just never do it right like it be easier to be right it's just having right a new one on our system or whatever I think those kind of things like you're working out a high level initial plan of like what's likely to be and hopefully talking to your counterpart there so that you're all on the same page about that what technically

01:07:10

needs to happen you're not getting down into like you know how they deal with transactionality or yeah use memcash I would use that is that's not the conversation you're not have it's like they're using red is we're using memcash maybe you talk about that like to be run it for a while like I was a little bit sure yeah I see you know so in a way technical due diligence is kind of a data point for I would say CEO and others who are the decision makers in this process to figure out at what point

01:07:40

this thing would be operational as part of an integrated stack and how much it would cost us to make that happen and if they are indeed break us in the process yeah that sounds right well the me tree sorry we're running late on time I didn't pay attention but this has been another awesome conversation with you and we wanted to talk about what you're doing

01:08:02

next but maybe we'll talk about that some other time and thank you so much to do that and thank you so much for sharing all the stories all right my pleasure thank you so much for listening to the show you can subscribe wherever you get your podcasts and learn more about us at software miss adventures dot com you can also write to us at hello at software miss adventures dot com we would love to hear from you until next time take care

✨ This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.

Early Twitter's fail-whale wars | Dmitriy Ryaboy

Episode description

Transcript

Early Twitter's fail-whale wars | Dmitriy Ryaboy

Episode description

Transcript ✨

Transcript