#131 - Data Essentials in Software Architecture - Pramod Sadalage - podcast episode cover

#131 - Data Essentials in Software Architecture - Pramod Sadalage

May 01, 20231 hrEp. 131
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

“The notion of transaction, consistency, and ACID compliance are many times tech imposed. It should be the business that makes the decision. We as technologists should not make that decision."

Pramod Sadalage is a Director at ThoughtWorks and the co-author of the Jolt Award winning “Refactoring Databases”. In this episode, we discussed data essentials in software architecture. Pramod started by explaining why dealing with data is hard in software architecture and some data related concerns we should think about when making architecture decisions. He then shared the thought process of how we can choose the right database for our purpose and shared insights on data modeling differences between SQL and NoSQL. Pramod also touched on the important considerations in managing transactions and the trade-offs between ACID and eventual consistency. Towards the end, Pramod shared practical advice on the step-by-step how we can split a monolithic database through database refactoring.  

Listen out for:

  • Career Journey - [00:04:23]
  • Data is Hard - [00:15:57]
  • Data Related Architecture Concerns - [00:18:36]
  • Choosing the Right Database - [00:24:19]
  • Data Modeling in SQL vs NoSQL - [00:30:28]
  • Managing Transactions - [00:37:31]
  • Tradeoff Between ACID & Eventual Consistency - [00:44:06]
  • Refactoring Database - [00:46:58]
  • 3 Tech Lead Wisdom - [00:54:58]

_____

Pramod Sadalage’s Bio
Pramod Sadalage is Director at ThoughtWorks where he enjoys the rare role of bridging the divide between database professionals and application developers. In the early 00’s he developed techniques to allow relational databases to be designed in an evolutionary manner based on version-controlled schema migrations. He is co-author of Software Architecture: The Hard Parts: Modern Trade-Off Analyses for Distributed Architectures, co-author for Building Evolutionary Architectures - Automated Software Governance, co-author of Refactoring Databases: Evolutionary Database Design, co-author of NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, author of Recipes for Continuous Database Integration and continues to speak and write about the insights he and his clients learn.

Follow Pramod Sadalage:

_____

Our Sponsors

Are you looking for a new cool swag? Tech Lead Journal now offers you some swags that you can purchase online. These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available. Check out all the cool swags available by visiting techleadjournal.dev/shop. And don't forget to brag yourself once you receive any of those swags.


Like this episode?

Show notes & transcript: techleadjournal.dev/episodes/131 Follow @techleadjournal on LinkedIn, Twitter, and Instagram. Buy me a coffee or become a patron.

Transcript

One thing I've realized in the past, probably 15 years, is this whole notion of transaction and consistency and acid compliance and all that stuff is many a times, a composed, as in all the business wants is don't lose my transaction. They don't really care if it's like acid compliant, or this all that, right? And we as technologists should not make choices. It's the business that you make that decision. Hey everyone.

My name is Henry Surya Vaughn. And you're listening to the technology, you know, podcast the show where I'll be bringing you the greatest technical leaders practitioners and thought leaders in the industry to discuss about their Journey ideas and practices that we all can learn and apply to build a highly performing technical team and to make an impact in your personal work. So let's dive into our Journal. Hey again everyone welcome to the technology.

Now podcast the podcast where you can learn about technical leadership and Excellence from my conversations, with great thought, leaders in the tech industry. If this is your first time listening to this show, subscribe and follow the show on your podcast app and social media on LinkedIn, Twitter and Instagram. And for those of you longtime listeners who want to appreciate and support my work, subscribe as a patron at technology. No dot f /. On or buy me a coffee at

technology node. Dev slash tip. My guest for today's episode is promotes Adel. GA promote is a director thoughtworks and the quarter of several data and architecture books such as software architecture, the hard parts and the jolt award-winning

refactoring databases. In this episode, we discussed data Essentials. In software architecture, promote started by explaining why dealing with data is hard in software architecture, and some data related concerns, we should Think about when making architecture decisions, he then shared the thought process of how we can choose the right database for our purpose and shared insights on data modeling, differences, between SQL and nosql.

Promote also touch on the important considerations in managing transactions, and the trade-offs between acid and eventual consistency towards the end, promote shared practical advice on the step by step, how we can split a monolithic database through database refactoring.

This is such a Great discussion with promote to discuss, all things about data and databases and I hope you enjoy it as much as I do. And if you do, I would appreciate it. If you can help, share this episode with your colleagues friends communities. So that more people would also be able to learn from this episode. Please also leave a five star rating and review on a podcast and Spotify which I will highly

appreciate. Let's go to my conversation with promote after a few words from our sponsors. Are you looking for a new cool swag? Tackle it You know now offers you some swags that you can purchase online, these wax are printed on demand based on your preference and will be delivered safely to you all over the world where shipping is available. Check out all the cool tracks available by visiting technology, no, dot f / shop, and don't forget to break yourself.

Once you receive any of those tracks. Hey everyone. Welcome back to another new episode of the technology on our podcast today. I have with me consultant at Bollocks, he has been there for 25 years, that's pretty long. And in my view, his name is promotes Adele, gay is actually someone who is very experienced in dealing with data databases the intersection between data

and applications. He's also the well-known author of this book called refactoring databases, even though it has been written long time ago, I think it's still applicable and relevant these days whenever you want to deal with rdbms and how to make changes and evolution towards your schema. So the cost of nosql distill recipes for continuous database, integration software architecture, the hard parts and the recent book evolutionary architecture, which is the

second edition. So promote, welcome to the show. Really looking forward for this conversation to talk a lot about database and data. Yeah, thank you. And we data in databases have been my passion for almost 30 years. Now I'm happy to talk about nice. So, promote in the beginning, I always love to ask my guess to share a little bit more about

yourself. Our first, maybe you can share some career Journey or any highlights turning points that you feel are worth to share with the audience here. My journey started back when I was in 7th grade computers, or spoon, had these two computers for Des Moines school, and you can go play with them. And you sounds, I need a little bit of basic games and computers all through high school. I got into computer engineering. We call it a finish that and during that time that is amazing.

Professor cordac examines each worker and he really inspired how to think about operating systems or distributor. So, I did a project on this team. What is the most efficient way to distribute work on cluster of computers? It's really interesting. And then after that job hunting stuff and ultimately ended up in quotes around 99, your four years in the middle, different

places. And we were doing this Amazing Project called a class where we had an existing database that was being worked on big upfront design kind of database that was already designed 600 plus tables. And then this whole notion of Isaiah 60 Blake was Catching Fire. And Martin Fowler, Edward curling up giving coaches and coaching sessions per period of time where so mind, changing that you would think, why was I doing up for Designing? The first place ever?

Like why did I even do that? Right? So we had this first iteration planning and we said, oh, we'll do this much first, happy whatever. And I went back to my test and build it on the people's that's posting a day and that's a scary thought for A person, but I did and every iteration as planned, new stuff, you do new

functionality, you back things. You've got things to the table and stuff are more stable, snow bunch of that constant change and there are about 30 or so developers in that team plus plus terms plus b, a is plus a bunch of other people and I was the only divorce. So anything that came to the you do this, I literally walked into their space discuss with them the chain. With them. And then I came up with this scheme of pregnancy, Henri meets

a change. How do the rest of the 30 people that teach because by that time I had come up with this concept of every question, working on the code base should have their own schema they shouldn't be like shit excuse, right? Because early in the beginning and I noticed that if they're sharing schemas and I make a change for Henry John here, mix affected because the see my history not quite okay. So I said, okay, let's play everyone. Schemas I came up with like a bunch of scripts.

They can create their own schemas, they can create their own, like popular, that schema, and all kinds of, but a bad script, I wouldn't call them devops, but a bunch of bats grips on Windows machines that you could run like, create my schema do that and it would create a scheme of Iran.

And once they happen and then we came up with this concept of all I want to modify or splits, I would speed with them and come up with a neat and then at some point of time, it so happened that this is all going good but now we are trying to figure out okay, how do I give away all the power and simplicity and work was last particular, slide? What are the discs between the requirement? So that's where the whole concept of version controlled migrations in Nevada.

So we didn't really have a number The thing is all these things. I just keep track of each one teach to tease myself and I kept applying them and I knew what to use like an attractive tracking metadata, in those schemas in what's amazed at what level bunch of that kind of stuff and then ultimately, we figured out progression of changes. And so when you are either in of the New Path somewhere, like you encounter a bunch of problems as we go, right?

And then Just trying to like, find out solutions to those problems. One problem was, how do I know what TJ was with this? Big was big? So then we started publishing, what was the last chain that was available when this bill was. And we took that put it as a tie in the Village, self-published Package. So who are unpacked the build new? What change was needed for that to work?

So, once you have that small stuff like that, if you really think out there was all of them with inconsequential, Initial selection of them was really a say-so, not some of those 10 stars and my opinion at some want. Look at the whole scheme, I had set up. And so, this is a really interesting from was like, you should speak about this talking about this, let other people know about it. And so he took me to Madison Java user group and they were like 600 people in the big

article in there. Everyone's had come mrs. Martin, not me. But Marty put me in front of Music. Some people for the first time and I was shaking, you know. Up. But I did give that talk and I sense like that was a good way to share knowledge, like you learn something better to share instead of this, keeping it yourself. So I sat in sharing and then from there on, like lots of other cops, I think speak on said, XD day is different

places. And I refined based on questions, like, in your project, there's a certain context. And once you've solved, all that context, there are no more new problems to solve. So when we go out, People conference stocks generally speak to people. You hear different contexts, right? And then you want to learn how to solve those make branching and merging. We will do Braxton. Because at that time we always did trunk waist up, but some people did.

And like, how do you manage team says when you grunting really interesting stuff like that. So ultimately, I think around 2004 or something additional, Martin came up with this idea of whatever I was doing at similar stuff, was also being done by Scott Ambler. Basically put both together and say quality the refactoring database is evolutionary

database design book. So we then took all the changes that I was doing on projects, like move columns, column table, to table and codified them and came up with the structure that just like how Martin's refactoring book. Say if you tell Annie - black method, if someone says that, you know exactly what so our language or naming. 1/2, a bunch of things that would happen, so something similar to that.

So, by the time IntelliJ had shortcut keys to expect method and a bunch of that kind of stuff, we are inspired by that. So we came up with that name came out over 70 or so patterns that we put in the book and we talked a lot about the book and later on, I got about 13 years integration. And so other 2008, I'm see this whole nosql movement was like really finding out and lots of stuff. Like we have look paper, come about Lots of people are doing

different things with a mongodb. I think point five point eight come out so I did a project with mongodb. We also need a project agree on and then I started talking thinking about the value designed objects in object oriented languages and how you store them in relational databases, a translation that needs to happen. It's not the same with no escalator, right? Like object-oriented stuff. How does it translate to document? How does it?

A spare, key value stores. How does it pass the columns or how does it translate to graph stores? So that gave me an idea of we should probably talk about how this modeling works and why you choose one over the other. I like to talk about this concept of like, an automatic car like gears, like you put in dries and you just try it. You don't worry about what's underneath, right?

Worry about first gear, second gear, this and that like, automobile reference here, that's what relational databases where they picked. See or anything else, they picked a bunch of that kind of stuff for you and you don't have to worry about anything else. So by default, you get those things and developers don't have to worry about other choices because there was no choice. But once you go to another escalator basis, there's a lot of choice.

So you have to decide what is the choice that you want? Because you can just go blindly saying, I don't care what choice once you go into that the availability of choices creates lots of options. And at the same time also, So creates lots of pitfalls with what if you make the wrong choice. So that's why we thought about writing that book, The nosql distilled Utah. This is a broad survey of what types of nosql databases are there, how we pick them? What is consistency?

The Gap here on distribution, mapreduce bunch of, that kind of stuff, all together. And how we should dispel people from like, oh no skill schema-less. So, I don't have to worry about schema. That's not true. The schema is there. A code it's not in the database but it is still there in your courts. We have to crack of it and stuff and how you shouldn't have to be like these. It's equal. Or is it nosql?

It's not that. It's just like with in the same Enterprise or within the same system you could be having more than one. And that's where the polyglot word came about is you could have more than one type of storage and we give like couple of examples and so that was pretty good. And then later on, I wrote something with Neil around software architecture, the hard part. It's about how do you pick?

Because I personally think many times the data people are excluded from architecture discussion and it may be by choice or it may be like a like once we decide the architecture we can just store data somewhere and I think that's a misnomer or you should be including data Architects, people who be with data into the architectural decisions because long-term where data is stored, how it's stored, affects a lot of

different things. Ultimately all business really wants to He's not your code but bait, right? Because business runs our data data is, what is stored persists applications? Come in go. So how to like, make sure your data is available, find about interoperable. Like this are principles, how can they be applied? And all that kind of stuff.

So, inclusion of the data protects in this whole, Enterprise architecture, architectural generating, should be encouraged and even data architect should be saying, I want to be Be there in that discussion. Yeah, so that's my journey. I've been here. I've got books for almost 25 years and still enjoying it. And right now, I'm doing a bunch of stuff around data mesh type of architectures and implementations of that idea. Well, thanks for sharing. Your story is really

interesting. How you got this, all started right surgery from refactoring, databases from project experience, having to share it in front of the audience, for that large amount of audience. I think that's pretty scary to me, as well in the beginning, I believe. And I think since then, yeah, you Dealt with a lot of our data challenges and wrote A Few books, along the way. So I had Neil a couple of episodes before talking about software architecture, the hard part.

And I think with having you here in this episode as well, I want to continue our conversation about the architecture but dealing with data I think, as we all know, dealing with data is very difficult. In fact, many people think it's the most difficult thing because your applications come and go like what you said but the data always stays there. So the first thing is probably if you can share with the audience here, how much has that changed these days with all the

advancement of technology? Different types of databases is data still, the hard part of the architecture. The Ada is still hard because bunch of data is locked inside commercial systems that are bought off the shelf, so bunch of stuff, just sits there. And then how do we get it out? So it's useful. Other parts of the organization's is very important. Many, a times data is also stuck behind some kind of I would say walls because oh it's available in that day only.

You can object to it or it's available there but you don't have the right Keys. You don't have the right way to get it out and stuff. So right more than the desire, I think the paths to get to the data physical and people are

working on that. And I think that's why betameche is become so popular is because it gives you like this domain oriented weight that getting at data and you can deal with data and piecemeal stuff, instead of oh, I need to build this whole data warehouse before I get access to it. Now, people are saying okay. Can we like talk about it in a smaller chunks of pieces and time to Market has reduced because of that, right? And that I think is making life

easier. There's also this notion nowadays all Cloud providers give you some type of storage mechanisms and those are also like all the way from like five storage block storage, all the data analytics based storage and stuff. I mean, there's a bunch of choice there. But as people are using Cloud Technologies, interoperable of that data is become a little bit

easier. And having access to that but we are still dealing with the problem of people decide to put stuff all over the place and then knowing what did I store? Where was it stored? What's the definition of that? What's the meaning of that column, or table or whatever is still harder, right? So we need to be a little bit more disciplined and structured about how do I store stuff. I mean, the basics of computer science is meaning things like

the name, your favor, right? Calibrate, all kinds of stuff. Like, even if you have a Json to do, you mean They sauntered attributes, right? So once of that kind of stuff, still needs to be followed religiously so that someone who has not worked with, you can understand what that structure is. Because again, we have so remote and select and or consists of in someone waiting, looking at the Json that you publish. And they may not even know

either that you exist, right? So how do they get what we desire means without spending? A lot of Cycles, easy challenge, right? And specifically you also invite that like just now when you were sharing, The beginning, right?

You invited people to include the data architects in the beginning whenever people are talking or deciding about the architecture so that they're involved, maybe specifically in your experience, what would be some of the areas of concerns for a bunch of Architects and data Architects to discuss about whenever they want to make decisions about data-related?

Yeah. So I think the first thing probably higher level thing that you should be thinking about is, what are the forums where beta people? Should we invite and once I've had clients that we go at, I tell them. Anyway away, this architect involved, you should have a data. Having said that the areas of concerns is generally you want to give people enough freedom to low-weight do things and that

kind of stuff. But at the same time it should be such a blank canvas that they do whatever they want. So a good example of this is a CR, a specific Cloud usage and of company. And someone says, hey, that other Cloud gives me that feature and this will Use that. So you should be having some guardrails saying, okay, you could use certain things Within These parameters and we won't

ask anything. But if you have to go outside of this, then you have to come and justify why we are not saying no, but at least come and justify why. So at some places, we have come up with a flow chart for picking the right database for its right. So you go through this tree of information, decision-making, tree and then you arrive at something.

And if you want to use that file, You make me want ask anything, you just start using it, but if you arrived at that decision and you say, oh, that other thing over there is better than this, for my use case, then you should have a conversation someone before you pick that up. So architecture, governance matters a lot in this kind of scenarios because for solution sake purposes, that may be great, but for million and sake, operational sick that other thing over there is a hindrance

from the operational side. So you have to have very good reasons why that is this. So that's one the other it is sometimes And to pick Technologies or Skies architecture sites based on our own biases of understand. I, if I am a data type of person, my bias is to make sure all data is correct. I have taken care of very low doses, right? Load versus really low and all that kind of stuff. And then do the right kind of

design. But somebody else coming from the other side may not be thinking about those things they may be thinking about what is a rarity, they may be thinking about coupling versus orchestration versus choreography. He and a bunch of others, all

valid concerns. But when you don't pair that with the data side of the shop, then you end up like, hey, this is all on software set is All By Design but I am taking all of that and storing it as blob, somewhere that nobody else can read and then you have visited with Corpus of modularity because now the data said you're not model so stuff like that

matters. So that's why I say, like any kind of new stuff comes up. They should one other place where I think this painting really helps is And product managers are business, people are coming up with new ideas, and they are saying, let's at least for by out and see, okay? How much this will cost is this your physical and stuff? And they generally tend to talk to The Architects and say, oh,

is this a workable? Next with the data, it should also be because they can tell you, where can I get the data from? Where is the data available? Do we have this kind of data available and stuff like that. Then also give you a way to make the feasibility question, answer in a complete sense. So the other place is also to talk about especially on the analytic side like are reporting side of things like that. How is data being given to you? What is the lineage of that data?

How what kind of Transformations are like? What are the standards being followed? What kind of quality rules you are in application development? I think that living standards quality standards bunch of those kinds of stuff is way of data people can really help because if app is gone through development or Going to software has gone from development for ten iterations. And then you tell, oh, this quality rule.

This standard has to be applied, nobody's going to listen to you, because there's no time now because the D, or the product manager of the product owner is going to say, I don't have time. I need to take this to Market, and now, you're carrying that TechNet forever instead. If there was only your you have talked to them, you would have gotten those unique standards, you would have put all those things in place.

So you would not have that. So stuff like that, but developer type level those things matter a lot over a period of time. The other thing also is probably coaching and mentoring in there should be a constant flow of ideas from software side to the hate, aside from the data side to a software service, and those could be like, what can I use

for certain things right? There may be some techniques, I can use on the data side that can scale this much easier versus use killing the app side like crazy, right? They made some things, I In say he'll if a package this data up and give it to you, then you can distribute it much easier than me, coming up on the data set its side. So those give-and-take, I think as a team you can come up with a better solution. Thank you for sharing all these

areas of concerns. And the I have to say for me personally as well dealing with data related Tech. That is not fun first, it's like a resume, it's risky. Also and second thing is how we deal with the existing data, right? Sometimes we have to patch them, migrate them and do all these stuffs in order to comply with

the new decisions, right? So I think it's always not fun and I think that you said if you can deal with it earlier that is always the best preferred options you mentioned earlier as well about picking the right. It database types, right. And these days, I think I remember when I read your book, nosql distilled long time ago, there was probably only four types of databases like the document, the graph, the key value store. But these days, there are so

many. In fact, there are plenty of new database companies being formed with their different type of characteristics. So maybe if you can guide us also for people here who are new with many different types of databases. Well what will be some of the attributes or Three sticks that we should think about whenever we want to pick the right database for our workload. Of our use case, maybe some guidance here would be great.

Sure. I think there are three or four things at a very high level that you need to think about. And then you can go down as you go down, you can have more choices to worry about the what is productivity like right now, human time costs, more than anything else. So if you are working longer hours that cost is much more than the cost of any other technology.

Ey. So, if something makes you more productive like, for example, if you're writing JavaScript and node.js and stuff, I have Json, I just need to store the Json and retrieve the Json. Go document. I spy database has been a simple choice, so I think productivity is one thing that drives a lot of these decisions. Like, oh, this thing makes me productive. I don't need to worry about other stuff and then the other decisions, they'll optimize later, but that's the primary

decision. The second is sometimes use cases which need not of thought to be picking the right database. What I mean by that is let's say there is some use case where you have like oh I need to store like trillions of rows and that won't fit in a single machine. So I need to find something that can distribute itself, easily, shower date, or whatever. So let me find the right type of or they may be some instances where you say, oh I have this

external. The fire that I always have and I can get data based on that, always, alright? Lee and based on that then you pick the style for that type of thing. So like what are my read patterns and what are my weight times? You need to analyze that a little bit and figure out what is the style of database that works for that? And the reason I'm saying is twenty years ago, we used to

think a lot about this space. How much disk space we have using, and the cost of the disk space and all that stuff. Today, the amount of time you spend in thinking about that will pay for this, right? So, the cost of the disk space is no longer a question. So it's more about how much can I write on my skin? I read, how fast can I read how fast you are? Need to read and a bunch of that, kind of questions come into play and then you pick the

right kind of database for that. Like, some databases you write only once but we'd be in sometimes. So then maybe you need something that is read performant and I may want to write the same data 34 times in. Different styles like different. Aggregations Sky, that's fine because I'm reading so many times and I want to make Vida fishing. It's our other places. The right efficiency in a matter of a lot, because data is coming so fast at you and you want to make sure you can write to write

efficiency. So then you pick something that caters to, right? And the third style is, what is it that you want to get out of that database, right? Like for analytical proposals are you doing lots of rows of executions that kind of stuff. So let me pick something that helps that. Or are you doing like graph traversal or graph analytics or relational analytics or that kind of stuff, right? So let me pick something that

works for that. And nowadays, there are many databases that coming up, but if you really look deeper classically, there are like five types. It's at a very broad level like relational databases key value stores documents and databases white column stores and crafts stores. So most of them kind of fit in this style. Like even a block storage is nothing but a key value writing a Blog. The file in is a key. Everything inside is a whale,

right? If you think about it, that way, then there are five big pipes and then you can decide which type of database that you want to look at and then you can within that type. Then there may be multiple Choices available and then you pick based on that, right? So if you take an example of how I need a graph database because I'm doing graph traversal fine. Now you have options of year for Jay, Neptune, DeGraff tiger grass and a bunch of other choices. Now, how do you pick that?

Choice is a question that always comes out, right? And then do that. Generally what I tend to do is set up our scoring Matrix of what is it that I would pay right in this graph? The sense, I can pick new Forge a paragraph B graph and Neptune put them on like a header type of stuff. And I have metrics on which I want to measure because every company can have different

measurements, right? Some people may say familiar, it is very important, and I have people who know neo4j, so no discussion for me. I just some people will say, oh, I don't want to leave the ews landscape that we are in, so I'll just pick Neptune in. You're done. Some people may say open source matters, Me and distributed databases matter to me and that kind of stuff by nodes are gonna be really large. So I need distributed graph

database. So people will say, okay will be graph so that I think is what you need to set up for your own company or your own situation, and come up with. What are you going to measure the products on, right? So set up a rubric for yourself and that score, these products, very easy. In fact, most of them you can try, you can run up, you'll see, you can do some research. And then set up that rubric and score yourself.

And the answer will just hopping speech and then you click one based on. Thank you for the guideline. I think is really interesting the way you first mention about the ease of use or the productivity, right? So based on the developers or the language stack, that you are using one question related to that, in my experience working in multiple projects and different types of products. One thing that always happens is that I've heard multiple times, they develop choose a certain Database.

It works well in the beginning but after a while, it didn't perform or it doesn't skill and a lot of learnings as well. Also point out that eventually actually many people referred back to rdbms, you know, things starting to become like a big quickest again, right? With rdbms may be either MySQL and postgres or even Cloud databases. These days also have rdbms compliant interfaces, although maybe the way they store or query is different.

So I'm interested to hear your thoughts about this coming. Back to sequel. Coming back to rdbms. Is it still the preferred default choice? That people should choose. And when you talk about readwrite patterns as well, maybe if you can give rough ideas when there's rdbms actually does not scale that people need to start thinking about other types of database.

So many a times when people pick up new types of databases for themselves, they have picked a new database but they're designed thinking still is in the relational world. So I have seen a lot of document database, This is where the collections have been designed as if they have tables and then the complaint comes in, like, I can join this Corrections. A document database is not supposed to join collections. You have to have like a aggregate version of the

collections, right? That's where this domain driven design, way of thinking really helps. So, in the nosql distill book, we talk about what is the boundary of you're a greedy. So good example of this would be lets say e-commerce system. You have a customer that has multiple orders and each order as ordered by items as payment information shipping information, probably packaging information on it. Like is it a gift? Not a gift that can stop right in a relational way of thinking.

You have a customer table, we have addressed legally, have ordered it when you have order item table, probably have a payment information and bunch of these states. Right? And when you go to document databases, you can say oh I'll take that idea and just have multiple elections of these things. I have a customer collection, I have a dress collection and that NASA, because when you put this back together, you have split the aggregate into a really small parts.

And now we have to put it all back together and the database was not break for putting all those things. Like, so where do you pick your aggregate boundary? One example would be. I have a customer agree, gay inside the customer we get. I have all the customer information and I have a collection or a list of orders inside it, and then eat, Order in turn the Json object. If we think about it will have the order Handler will have the order items. We have to paint everything all

inside. So when you go to the database and you pull the customer out, everything comes with it, so that's one and somebody may say, oh what is a customer has thousands of orders. Then my objects will become too big, like, you're transporting all of that over the wire, sending it all back. Okay, now it's time to think about how the split disagree it's to be. So then you can say, oh, Want to have a customer Aggregate and I have what I have orders. Agree.

So you have one object for customers that has all the customer information, probably their preferred shipping address before, packaging whatever and then you have one object for each order and the order has a reference back to the custom. So when you go back to when you want to show, all the orders about the customer, you go to the customer, pick up the customer and then go to the orders list and say, give me all the orders that have this customer.

So now you have split the eye. Aggregate but not too much just a little bit so that it can manage this one. So you have to think about these kinds of stuff because when you go back to the standard relational way of thinking and you speak too much, of course databases, So, I think the first thing generally, I recommend when people pick a new database

style, is do pet project. Like, don't do your actual project look at project play with it, try to understand, okay, how do I need to think about it? How do I need to worry about it is need to get training, get training. If you need to like talk to someone who's experienced in that database, talk to them, understand this design patterns that are different. So again the classic example would be like graph traversal, for example, some Probably say, oh, graph traversal.

I can do this in databases, like, relational databases. I just put pointers and stuff and after two or three levels of whiter, you'll figure out relational databases to work and then they go to graph databases. And then in graph databases, they start treating each node as if it is a table, it is not. It is a single instance of row that is connected to something else. Alright?

So we have to think about these things a little bit differently and Then you realize that the gains of using that technology. So I think that is what I would recommend people. Like figure out how this tool Works. Get some training topic, someone who's experienced play with it a little bit, solve a simple problem or something and

understand how this thing works. One other thing I have also done is if you know that in this space of this database type, I need to pick one technology that was talking about the rubric before, right? So performance can also be a rubric on that like, does it Handle of billion notes, for example, or does it handle a billion documents? So put a billion documents nowadays, you can create fake data, put it in the database and see how it works.

Setting up that experiment, we take a little bit of time but it will save you time in the future. So create a framework or create like a harness, which puts in all this data in there and then you run your query load. And this is where knowing you're pretty pattern. Makes a lot of sense. Like if you don't know how many rights and how many reads you have it, Cannot set up a

performance harness. So you basically once you've set up like a billion documents for example, and then you know, can going to do like 60% of my load is read forty percent is right. Okay, so what does that mean in one hours of production time, I'm going to do this many nights. I'm going to do this video reads, right? Okay. Set that up and run it and see what it does and then you will figure out. Okay, should I change my design pattern?

Should I change my product, should I change my Bye. Should I change my technology? Whatever it is. It tells you where you need to focus, I think you're sharing about. The aggregate boundary is gold. I feel this will save a lot of developers time and effort as well. I think it's a very good advice that we have to look maybe from the use case the business use

case product requirements. What kind of aggregate boundary that we are dealing with right and then not to use the old paradigms, the rdbms, these data modeling, apply to different types of databases which is Is probably a majority of performance issues come from there, right? So the impedance mismatch between the data model and how you query them. So, thanks for sharing that. The other thing that I want to ask is about whenever we have

all these data types. Now, people tend to think about touring special type of data into different databases or what we call also polyglot persistence, right? Especially now with the microservices, maybe one service could have multiple databases but it also comes with a hard challenge, right? It's about the transaction. I It's always very important to think about transaction, and especially with polyglot, persistence multiple data types, right? How do you have this in a

distributed manner? So maybe from your view, how should we think about managing all these transactions may be multiple databases if they are involved, right? Is there any special thing that we have to think about? Yeah. So I think that is always such a use case, specific discussion that needs to be had with the business like one thing I've realized in the past probably 15

years. Is this whole notion of transaction and consistency and acid compliance and all that stuff is many a times, a composed, as in all the business wants is don't lose my transaction. They don't really care if it's like acid compliant or this or that, right? So many times I always tend to go back to this discussion of how much money do you want to put? Like, money is a proxy for time

here. Basically, how much money do you want to invest to make this as a top client or This 100% consistent or whatever, and the business should be driving that, right? They should be saying it's okay if it's famously. Okay, then I have a different solution for that worship. Someone say, oh, I need to make sure all stores are consistent all the time and that cost a lot. And is a totally different

solution. And we as technologists, cannot make that decision or should not make that decision. It's the business that should make that decision because they are the one who are deciding, how is it that they want their data. How is it that they want their systems and how much do they want to invest in that desire that they have? So we should expose this to their business. That's the first thing I want to

say. The second thing is let's say someone says, I need a highly consistent database or highly consistent data then, okay. What are the patterns in which we can implement the consistence that comes into play? And then within a micro service, if you have, let's see more than one database. Then how do I make sure they consistency? Of the rights that are going on.

And in most statistical, test systems, even if you're writing to the same database, you will have consistency issues because let's say you are writing to Cassandra which is a distributed database and the data mean reach one node. And you may say oh it's done but that nor sales that we have lost in it. So you still have to think about you and it's the same database you have to think about what are the consistency requirements that I consider. So every right could have different consistency

requirements. Well, I'm writing just some kind of audit log that will be a little bit different consistency requirements versus I'm writing an order. So every ripe you can decide what is your consistency level and many of these databases especially if you're writing to single database, provide you different types of consistency guarantees, right? So I think they are called like forum is one when it says majority of the nodes got it.

And then you can say, oh, just neck as long as one node, got it. Find out as I go, there's more than two notes have it? I'm fine Arabic in mongodb. I think for example you can just say, oh as long as the not receive they can find. I don't even care if it does it come to the desktop. So a bunch of these options are available and it'll pick which option you want based on each different type of right even within the same application.

So I think making these kinds of decisions inside act like probably a tech lead level or at architecture level. You can say oh this type of Rights will always Go with this type of consistency, this type of Rights will go with this type of consistency and probably encoded and don't make the

developer think for every right. Just have some kind of a convention about it and you go. The other part is when you're writing to two different databases, like one probably is like, let's say you write a relational database and the other is going to a graph database or something. Then now, you are in this. Like, if I want to maintain consistency like, a hundred percent consistency, what are the patterns that are available and implementing those patterns

that you can use? Keep our transaction coordinator or something like a two-phase. Commit is a really hard problem and you really need to think if I want it. And what is the types of challenges are willing to take, on Nick? So, Many Items, what happens in a polyglot, like the relational and cross. As an example, you are writing to relation now for operational purposes and you're writing to graph or graph traversal analytical, probably recommendation engine, that kind

of purpose. And Those things can be eventually consistent and relates now can be like 100% classes. So you can even make a decision between that like, what is use case in which I am writing to multiple layers, right? So someone writes takes an order, your forces, the order and the data. Eventually within the next five, ten seconds goes to the graph database and some other person is going on web page.

And you want to show them a recommendation based on the grass, how much wrong is it for the recommendation to not based on this last outer material? That's a trade-off, right? And you say no no it has to be in a hundred percent, all the orders that are committed here. Recommendation, has to be based on that, then you click pick a transaction coordinator, you put all this, the architecture complexity increases because of

right and that's a trade-off. You have to take on like oh I cannot even be five seconds behind. So I'm increasing my architecture complexity. That's a bit of you mean and is it okay you should ask the business, is it okay? And then if they say yes, that's how I want it. Then yes, then you increase the architecture of complexity to meet that business, need the introduced a transaction coordinator with the night happens. Here you go there, right?

That side. And then of course, those come with its own trade offs, introducing transaction coordinator zookeeper, but then you made a informed choice about using. Thank you for sharing this guidelines, right? I think it's really key for people to hear, right? So always involve business, whenever you make this type of decision, maybe sometimes business actually don't require. Real-time 100% consistent within milliseconds, right? Sometimes it could be delayed,

they accept delay. And knowing this is actually key whenever you create your solution, right? Because then you can have different types of options. So I have one interesting thought, right? Whenever you said in the beginning, that it stuck in post and how much money or effort you associate with choosing this, right? But actually many types of this decision, like you said, be stackin post is developers who make the decision and actually from their arguments.

Choosing like an acid compliant database is actually much much faster and saves a lot of effort because you are when you deal with eventual consistency, that means you need to write a lot more code. You need to do a lot more testing so maybe a little bit of advice for developers here, which I'm sure that we have plenty of a developer listeners, how we should think about the effort here because storing into an acid compliant is pretty well known.

These days, there are Frameworks very easy to do. First was having to do it in an eventual, consistency Manner and the effort that requires Them to do that. So maybe an advice here for developers as well. Yeah, certainly I made again, I like to think about everything as a trade-off, right? So you when you say as it complied it makes my life easy great, you traded off, Skilling for that, right? So when skill comes in, then you start having this replicated data bases.

Relational databases setting up replication and making sure replication works properly. All of, that is extract stuff. You took on because you traded off. Scaling for asset, no plans. So when that happens, unknowingly you have three did something for something else and I am saying make that visible for yourself, like, okay, I pick tacit compliance. What did I give up? And once you start thinking about those terms, it makes a

lot more sense. And someone will say, oh my database doesn't have to go beyond 50 gigabytes fight like that trade-off is perfect for you, but if someone saying oh my business is used across the world, I will have billion millions and billions of Actions Rose reads, writes, whatever. At that time, we cannot just blindly say, oh, I pick asset our minds because that comes with some very tough choices and making those visible to yourself.

That's why I was saying the clearing, is someone else in this time, makes sense, because you have your own biases and everyone has given, I have so, I generally say, oh, let's sit and work together because they will expose my vices, which is good. Well, I think that's a very interesting insights, right? So yeah, sometimes we develop, as we tend to have our own bias and we tend to think of the problem, may be from limited

perspective, right? So having other people, giving the same thought process may be from different perspectives, maybe design, review, process or even pairing whenever you come up with this decision at the is always important. And yeah, like you said, in the end, there is always a trade-off, like Neil also said that, right? So architecture is all about trade-off.

So knowing that, when we pick something basic, We also may be compensated for other aspects, which we may not realize, but sometimes it happens later on, right, which is normally the performance scalability issues. So, another thing that is commonly being discussed in this data world is we all have microservices, right? But most of the time, start up start with the monolith and then hence we need to break our databases, you know, doing more like if illusionary, schema

changes. So, since you wrote this book refactoring databases long time ago, and I think it's very applicable. I will be split monolith into microservice. There will be some of your advice here for people who are going to Embark or are embarking on this journey to split their so-called giant monolithic databases into more multiple types. So is there any key advice that you want to give here?

Yeah definitely. I mean there are a lot of stepwise things you should probably do I think in the software architecture of the hard part to do, we put up like a five-step process for this. The first step is to identify what are those boundaries? He's being the links, right? Like, if you are going to take a monolithic application and split it. Let's see. 3 for example, say then there are probably going to be three domains of three services and then you should decide within your papers which

service owns Which tables. So let's assume for Simplicity sake. They're operating table. So each service is going to get six tables or maybe some service gets for some service ks8, whatever. What are those tables? They belong to each one of those. Right.

So create a boundary, kind of a situation and then once you have, that cannot be find, then there are Educators here because one service may be right into the table and the other service me reading from it. So you need to decide who is the owner of this table so that you can clearly see. So, first step you can do that. And the second step is in the same database start, moving these Stables owned by service, a into service, a is scheme like many tables.

Any database products, especially relational database products. The give you the facility to move between schemas like you can see, outer table moves, schemas you can just move it to a different schema and it just gets allocated to a different scheme. So you are logically segregating not physical situation.

Yeah. And the monolith is still talking to all of them, probably you will have to do some kind of a synonym or something so that the underlying movement is transparent to the application, right? So once you have On that you're physically separated the tables. But the application is still talking to one database, schemas wire synonyms or something like that. So this is where now, you start to work on the application side,

say, okay. Now all of these tables, this part of the code is talking to that. So maybe we should create a separate connection pool for that or maybe create a separate thing for that, that is talking directly to that once and you do that on the application side and your deploy. Make sure everything's working and throughout this journey, our goal is always to iteratively keep work, right? It shouldn't be that you do all these first five steps and deploy once, it's every step you

deploy, every step you deploy. The because something was wrong rollback is he's so Second Step. You do though, it's tough and third step. What you do is now that you have three schemas in your sample application three schemas in the same database that are physically the same place, but logically separate, right? And the application is

independently talking to them. Some is a is talking to schema a service, be stopping to schema be service, he's talking to schema see for all practical purposes. They are independent. The only thing that are dependent is on one machine because they are all running on lunch. Okay, so now this is where you start bringing in replication, give install two other or maybe three other databases of the same type and replicate schema.

A to machine a replicate schema, beat machine, the machine And you just set up. Replication, don't do anything else and then whatever is happening on here with rights, get replicated to other side. And one fine day, you can just switch connections to those others. So now service is talking to machine, hey service, be stopping to, she be and service, he's talking to machine, see and you can get rid of this database break, the replication get rid

of this database. Now, those become the main databases and now they are all split into Socrates. So it's a five-step process. Her some complications in this of course especially like I said, one service is writing to one table and other services reading from that vehicle, that service now needs to figure out, where do I get the data from? Because it's in a different database.

So you have to create probably some API changes to read, don't make the service do a cross database connection because that defeats the purpose of microphones. So you have to expose an API to do that and a bunch of that kind of dependencies will come in, but it at least gives you You starting path think about this. Like, how do I split water plants and how do I get to the end goal?

And many a times, what we do is you don't split everything at one girl, you just take one piece of functionality, take it out and then do whatever you need to be done.

Just take that one thing out. So if you think about this monolith and percent is gone out or different ways, 90% is still emotionally and then you think about what's the next thing I can work on again, business value plays a big deal here because Within Only the reason to split generally is because of Architectural Components, like, scalability auditability deployability, bunch of those kinds of stuff is being hindered

because of a male. That's why you want to split and scaling is the only problem you're trying to solve. You take one or two of these out. The rest of the application is fine because it doesn't have that much load. The to that you took out has a lot of loader and you can scale them out and whatever you probably don't need to split the rest of the 81st. So you have to think about it this way, what is it that I want to split first, split it out and then do I still need to split

more? Okay. Then what is it that I need to speak? Next does the remaining needs to be split if the answer is? Yes, keep going down that path. The answer is no just stop stop right there because you don't need to split it. So I generally tend to think about it, like, did I achieve what I wanted to achieve the eligibility, requirement irritability, whatever it is, that you are trying to fix.

Once that speaks you can stop splitting I think in the application software worked, it is also known as the Strangler pattern, right? Yes, exactly. I think that was written maybe around 2008 or something. Yeah. And I think your advice here, don't do it in one go because sometimes we want to do it in one go because we feel it's risky. And we don't want to deal with this kind of work because it tends to be quite long right. Whenever we want to evolve our

database, schema. Right. So I think don't do it in one go. Do it iteratively right? Know what's the target as well, like what you said? So maybe we don't need to split everything in two different databases. Some could still exist in the model. It as long as you have the proper boundaries and you define the ownership really well, while some may be different quality attributes, require you to actually split them and hence yeah, you can do this migration. And I think all your books here

have the recipes. I would say, the software architecture, the hard part, I think like you mentioned, there are five steps you can think of how you deal with this splitting of databases and whenever we deal with eventual consistency and The transaction is also covered there.

And eventually also, if you deal with rdbms and you need to evolve your schema, we can always go back to your classic book, refactoring, databases, where you can have splitting column, renaming columns and things like that. There are 70 patterns that you mentioned in the beginning. So I think for people here who are dealing with all this data Journey, please do check out promotes resources books, right? And also lately about data measure, I also had an episode

with drama. I think it's also becoming trendy especially me. Missing the operational and analytical concerns, and also how to get data out of it easily, right? So, I think all these definitely are some of the trends in the data of space. So promote, thank you so much for this conversation. I really love and learn a lot about how to deal with data

better. We are reaching the end of our conversation, but I have one last question that I normally ask for all my guests, which I call the three technical leadership. Wisdom, you can think of it, like, an advice for people here to learn and listen from you. Right? What would be your tree?

Technically, the It was them. The first one I would say is keep learning new things, whatever it is like in your own work area, or it would be outside of work or whatever it is. Keep learning new things. The second is, make sure you're mentoring. Someone always could be other developers, it would be the data people, it could be someone outside.

Make sure you're always mentoring others and I have found that mentoring others, gives me New Perspectives on things because I may be telling them things, but I'm learning a lot. Why? I think I find that really interesting. I do enjoy mentoring. The third is always, I have said this problem with three times already, this, whatever decisions you're making involve the business because it has cost implications. And as technology people, I don't think we can decide on cost.

It is businesses decision to take and give them the choices. Because sometimes they don't know the choices and they just assume like using big words like we want asset versus you in terkoz's, they have no idea what you've done. So give them the choice in the show. Oh them like a rubric or show them, comparative analysis, show them. If I choose this, these are the implications. If I choose that Nations and here's the cost for a versus cost for p. Not pick one. Right?

Because that gives them, lots of information on which they can make a decision and it will make you a better Technologies, because you are thinking as a neutral person, giving them both choices, right? And that forces you to do deep analysis and research, when you're putting those two choices. If one choice is lots of pros and the other choice is not sub cons. Then they know that you basically want this and not bad and that makes you focus more on.

And how do I analyze this? How do I make sure there are Pros proper, prozac, both sides, proper content website so that people can make. And I think it makes you a better Technologies when you make someone else take the decision and you are giving all the information.

Wow, I really love that better Technologies is not the one who make the decision but actually they're giving the people the chance to make the decision knowing the well-informed information given by us from Technologies things rights. Very, very lovely and also what you mentioned sometimes. Yeah. Business didn't know what options exist. They're all know is just like jargons here joginder and yet they will just leave it to

developers to decide. So I think as developers we have to give them the information in the may be easily understandable. And yeah, decisions should be taken collectively. I would say, not just my business also, but also with the Technologies. What are the implications of the decision? I think that's where we need to be making decisions, probably just toss a coin and make decisions but what the implications of the decision Downstream our rate of X is the

key part. Yeah. Correct. So implications is always something that we regret whenever we reach that stage. So I think you're the implication should be well, thought out and discussed in the beginning. So promote, if people would love to continue the conversation here, would like to ask you a reach out to you asking about data related stuffs, is there a place where they can reach out and contact you online? Yeah I do have a Twitter handle. Its promotes analogous at promotes evaluate.

I do also. In a website this is not a.com so you can reach out to me from there and kind of like very easy to spot me on Twitter and Linkedin my website. So any of those forms are fine and I generally tend to reply because I am super interesting data and its effects on people. All right, thank you so much for your time. I really love our conversation. And again, I hope people here are much well-educated about dealing with data and doesn't think it's a hard problem anymore.

So, thanks again. Promote. Thank you, and we thank you for this time and thank you very much. Thank you for listening to this episode and for staying, right until the end if you highly enjoyed it. I would appreciate if you share it with your friends and colleagues who you think would also benefit from listening to this episode. And if you are new to the podcast, make sure to subscribe and leave me your valuable review and feedback.

It helps me a lot in order to grow this podcast better. You can also find the full show notes of this conversation on the episode page, at Tech Legion o.f website, including the full transcript interesting. Quartz and links to the resources mention from the conversation. And lastly, make sure to subscribe to the shows mailing list on pack leader. No dot f to get notified for any future episodes. Stay tuned for the next technology. No episode. And until then goodbye.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android