¶ Intro / Opening
So it's no longer the case that you just need one technology that we do all of it. In fact, in the cloud. You will have completely dedicated Technologies, which are meant to do. Exactly. That one thing. I would say. It's actually par for the course to re-engineer on, okay, that happens all the time. This often does need to change, right? I think that's some, in my view inevitable, but I think we design principles, they hold the
test of time. For instance, if your bathing business is business decisions, right? On top of it, you have to Absolutely, give the guarantee that the data is correct by the data is wrong. Then you making wrong decisions in potentially affect people's lives in some form. Hello, and welcome to data. Shatter the podcast on all things data. This podcast is a series of conversations with experts and Industry leaders in data. And each week. We aim to unpack a different compartment of the data
suitcase. I am your host got the chassis that I'm a blogger newspaper. Columnist book author and a former data adds a jury consultant at currently head, analytics, and business intelligence for delivery. One of India's largest logistics.
Companies. You can follow me on Twitter at Karthik s that is ka r. Th IKS and read my blog at no Intruder.com., That is n. 0e n t, hu b, a.com all opinions expressed in this podcast, belong to be and iPod touch base and it did not reflect the views of any organizations. We might be Associated. Nothing discussing this podcast to be taken as financially for me like this.
I will put delivery one. India's largest logistics companies at the time of recording, we deliver around 1 billion packages each day and each package gets scanned about 20 times, at least, as it makes its way from origin to destination. So you can imagine the amount of data that we are dealing with. It's firmly in sort of Big Data territory and as head of analytics, I have the job of making sense of all this data and the first step in making sense of data is to organize it effectively.
Today's guest is rangarajan, vasudevan founder, and CEO of the data team. The anger was my classmate at IIT Madras and then went on to do a master's in computer science at the University of Michigan, Ann Arbor. He was a founding engineer at Aster Data Systems, which was acquired by teradata, in 2011, in 2015, vanga starting the data team, which is building, AI solutions for customer.
Intelligence is also a guest professor at IIT Madras where he teaches a course of Big Data, you can follow him on Twitter at ranga vasudeva. That is our Ang a Bas ude VA n.
¶ What is Big Data?
What is Big Data? Well, I think we. The original connotation used to be about very large scale data occurring at selling very high velocities. And there was also this added connotation that this was data that was traditionally beyond the way the classical database is used to store data in an Excel sheets, in a spreadsheet, row column that kind of format, but that that was when the concept originated by back more than a decade ago decade and a
half ago. Yep, nowadays, if you really think about it, most the most of the conversation is around, definitely the volume, and velocity, that's definitely there. And then, there is this, this added implication that there are so many types of data that's been generated in the, in the digital world. Yeah, some of which is not very conducive for analysis by regular methods. Right of of what Use as a favorite visualization of the bi to e. So it's a broad term that encompasses all of this.
Yeah. For the most part. I think the volume and velocity has stuck on as Key Properties. I would say thank you. Okay. So let's just get to the problem statement. Right? I mean if you think about the problem statement, is that like you but nowadays it's you have a lot of data flowing around. It's easy to collect the data. It's easy to store the data because of the so called Big Data and things like that, right? And cloud and All that. So how are you supposed to sort of organ? Now?
The reason we collected data and store, it is with the hope that someday we'll be able to make good use of it. That someday someday could be tomorrow, but like that, we could do some analysis, make some use of it there, different kinds of analysis that you can do on the data and so on so hard, even with the cloud and big data and so on. How do you organize the data? However, what are the different sort of mechanisms in which data is organized?
And like how has that evolved over the last 10? 15 years ago? Yeah, I guess. That's the one who doesn't things there. Firstly, I think one of the key motivations right for why why big data on makes sense the way it does right now is that storage is always become cheaper and cheaper right as a years of program. So the topic of whether to store the data has become almost like a no-brainer for most companies and even individuals, right? You get a 1 terabyte hard drive
for what three hundred bucks. That's really really cheap. Yeah. So then the question is Is it enough that you're able to store the data? Obviously not right? But yeah, in at least a decade ago, it used to be the case that people were throwing away data.
And that was because the traditional methods of storing data which was databases and things that are or even storage and networks and network attached storage is and things that all of them were fairly expensive when it comes to the form factor versus the amount of money that you're actually paying for it. Yeah. So as technology has evolved as it became more Monetize, the storage equation. So it's no longer considered bad practice, adjust or how much ever you want.
But then, I think the Crux of why Big Data makes sense in conjunction with something like the cloud. Is that along with the fact that you need to be able to store? You also need to be able to plug in play, extremely different
types of accessing energy. Yeah, not all forms of accesses are going to be applicable for During every consumer or every every user of the data, like some are going to be really proficient with like, Hands-On programming while some like the business users will just need like snapshots of what they need and they'll be people in between right who know some amount of programming. Let's say a declarative stuff, like a skewering. Yeah, and they'll always be
some. Some folks who are data, scientists A machine-learning Engineers or statisticians, who don't really understand the weeds of like, distributed programming or anything like that, but they still need to be able to access the data. So what, Therefore emergent was this need to unlock agility. Like yep, and the unlocking of agility first started with Hadoop and Technologies like that. We should all predominantly on for my strike in the data center kind of technology is okay.
So a little bit. How do emerge as an onto my sting? Is it? It didn't. It's not Cloud native. I do when it first came in, it was done by Yahoo to basically do a better job of indexing, and, and retrieving their internal search data. Yeah, right. And this was all done on Prem. So in fact at that time, how do became popular because the old school way of doing things classical data warehouse, database ways of doing things were just extremely restrictive, right?
The technology was getting in the way, right? Yeah. So I took was one of the first Big Data technology so to speak, which, which didn't get in the way, you know, in a sense of little of the anybody to just put all kinds of things on top of it, to go answers and process and do things with it, right at the - fundamental level that was
a very, very attractive price. So as a result it caught on like wildfire obviously the commodity pricing model all of that helped so it caught on like wildfire became extremely popular, but then it imposed other kinds of costs. Okay, and those kinds of costs were like, maybe we can discuss that separately, right? But that's a different examples, are the costs related to data processing and what does it mean to reason about data correctness and things that all of that
become became? When problem statements, now, with Cloud again, we don't journey of cloud started off. As, you know, let me just run to your virtual machine, but then as things progressed as Big Data evolved in its parallel track Cloud started to pick up and say, look, I have all this years of experience of having done, Big Data, well and understand understanding all these other ways in which big data is
failing, right? And I know, of course is how the classical Technologies our fingers That have really black box. Two people have let me actually learn from the best of both worlds and bringing those features and benefits. Right? And so that's the, I guess the single biggest value proposition that cloud is brought to the data table, which is that you get all the editing of variety of different types of access patterns. And at the same time, if you need that, if you need for speed, right?
Because of organizing it in the certain way, which is extremely optimized for it. You can do that as well. Right, so, Cloud has now gotten the best of both of these old-school worlds and India, uh, agility. And at the same time it's gotten his own complexity, right? Always with Cloud things are not as easy as it seems on the surface. So that complexity is also
something you have to plan for. So very, I would say, also will calm, submissive become easier, but there are newer challenges that have come in because of cloud. Got it. Okay? Now basically, I again going back to fundamentals.
¶ Why do we need to store data?
I think we'll keep Back to find a bit easier here. But like, so we are collecting all the data that we can get. And then we want to store it in a way that is easy to access and analyze and so on later on. So in your view from what you have seen, what are the different use cases for which people want to access data? What are the different kinds of things that very broadly speaking across domains and
things like that? What are the different kinds of things that people want to do with data that has been sort of just collected and dubbed in one place and so on. Yeah, I think the first thing to do, First thing that everybody wants across the Enterprise has to reason about what is it existing there? Which what it exists, right? Yep. Yes, like the the first use case and that's it's almost like a like a prerequisite step died. Even before you start to planning start to plan other
kinds of things. So but classical beginner world even reasoning about what exists is actually a very hard problem. The reason being for see that there were the volume of the data, just So large and secondly, the fact that it's spread out, and at the end of the day, it's just a file system. So, whatever you put in, is what you get. So garbage in garbage out into those kinds of issues are also there.
So, okay, let's first use case. The Second Use case is the fact that you now have to start to know what has happened recently. Yeah, and how is that comparing with what has happened in the past? Yeah, and that's a very common question that most decision makers at the business level. Ask me. That's what we would call, SBI devoting and things that are right. Then the third set of use cases is accelerating and your, you start to have the analytics types who want to look at things
like, okay? Yeah, I have this hypothesis. Let me go test it out. So, they want to know compare what happened. Let's say, lastly, while he was just this, the vale here and let's see, what are the differences in fraud. Right? As an example. We have that's very exploratory, simulations LED, right? Then the fourth set of use cases, are people who want to learn from data and sort of predict things in the future. Yep, that's the the data like this model building here, kind of tight.
Right? Then. You also have the special functions. I would say these special functions are monitoring things like overall risk to the company. Yeah, right. Overall Financial is and things that aren't even answering questions post by regulatory bodies, right? And that's a very common tasks. You have been you start to get into those kinds of special requests. So those also are critical. Use cases that are going to be put as a subset of analytics. In some sense. This is not as so it.
Actually it's more. It's more a data dump like right here, right as opposed to as opposed to anything else, right? That's why it doesn't fall neatly into one of those other categories here. The example here as a telecom provider like you could get an F IR to you know to to share data of 15 listed delcambre numbers, right? Mobile numbers. Yeah, so that's a glittery. You have to give it a time or the police authorities.
So there's no analytics. Are you just you just have to dump it, but that will be from any time in the last seven years has an example, right? Right. Yeah. Yeah, and the last use is one of the very important uses is also worked. Can I keep improving my internal systems by analyzing the exhaust that's coming from it. Yeah, and that is also being
stored in the same play. So now, I mean, obviously, I mean the way I see it and so on, like the way you access data for defeat, Each of these different things is very different. So for example, if you were to look at a bi use case, you probably only only need the last few days of data for some some, some stuff, some other things. You'll need a longer time period. So it's almost like you are to take a little from your reading from there and so on.
So it's a different sort of an accessing things like that for your special functions thing that you mentioned.
¶ Principles of data architecture
There. It is like going to some particular place in the past and like and then getting stuff out of there and it's like going to some unused basement and Finding some file like they show in the movies and things like that. I'm guessing so how so I'm assume that the way you need to store and structure. Your data is very, very different based on which of these needs are dominant. So and I assume there's a sort of a trade-off. So what are the sort of the real trade-offs here?
What are the Technologies? Or I would call it Technologies, but what are the principles According to which you organize the data to optimize them for some of these different combinations? And use cases. And I really like the word. Do you use to of fundamentals? Right? Because I think this is exactly is a fundamental design principles for data architecture. Yep. You have to think about all the different use cases that that are typically existing just from a consumption point of view,
right? And and cater to it with with a with a very important design principle, which is that what is the least common denominator? Right? Which actually serves all of these needs. And if you take that LCD, kind of an approach, you will see that - 6 natural into that site, which is that. Let's just collect all of it together.
I'll beat with some curation. I can't just dump it all and figure it out later on. You have to put some curation, put some boundary around it. And those are all not just a property of the use case, but also just good hygiene, right? You need to know. You need to know how you're getting something. Otherwise do not appreciate the value of getting it right? That's a almost make it a life principle. Yeah, but If you operate the LCD LCD, thought process, like Occam's razor, or something like
that. The first thing that happens is, you know, we built a big kid earlier. Yeah. Now, once a player is done, then each of these different use cases. Actually kind of Branch out into their own storage options and optimization of things like it. Rightly. I do too, right. And here, the nature of Technologies has as completely, you know, I would say evolve over the last 10 years, okay. It's no longer the case that you just need one technology that will do all of it.
In fact in the cloud. You will have completely dedicated Technologies, which are meant to do. Exactly that one thing and also sort of we have a very good price point right to make it justifiable for you to kind of invest in that kind of Technology the first days. So yeah, if you just want to give you just go use case, but use case for the vi side, you would absolutely need a way to curate data even more than what you've done in the Big Data
layer, right? And here, the creation that you're looking to do is not just reason about where the data came from. But also reason about how different representations of the same data look like, right? And therefore, what do I triangulate and and promote a steam, the representation that I want, right? For my business. So that's a crucial task, right? And then in a conventional sense, it's called as data
integration. But yeah, that's a that's a crucial topic that you have to address and there are Technologies which are very good at, you know, optimize can be very good at doing data integration and here again, there are two angles to it, which is the fact that many businesses are, okay? Looking at, you know, data, as of yesterday, looking back in time. Yeah, right, but then we always businesses like, for instance. I guess you don't employ a right who wants always real time data.
Yep, I guess hosting and data of what is happening. Not is this. Ali analytics and machine learning but just give me today's date are giving house. Clothing company. Yeah. We need to know, like you asked me where your packages for that. We need to know where each package has real-time and so on. So we see obviously we need the data. Exactly. And you know, you want to know how many packages are being missed because of the storm happening in Gujarat, right? Exactly.
So those are extremely important real-time data. And I would say that real-time traffic also used to be a problem with classical Technologies, but nowadays there are options available. Even for that in conjunction. Would be decision-making that needs to happen on the vi side. So on the pi site, I mean this is this is what one set of Technologies would be would be doing machine learning.
I mean, obviously there are many many Technologies each having its own pluses and minuses, but at the heart of it, they all work off the LCD, which is the beginner storage layer. Right here, very few people require. I mean, what if you get assigned some machine learning technology is really require you to curate in an additional storage layer.
May be there optimizations, which are used Abel like, for instance, being able to stand up data in memory so that you can hydrate much faster on it. Yeah, that's a, that's a pretty useful technology. Some, some technologies have it.
The other angle here is if you want to do really large-scale, complex data, science occurrences, apply, a team Learning Network. Then you need other mechanisms of exchanging data, so that the volume does not, you know, flood you down and the task actually has a reasonable chance of completing their. Our technology is available for that as well. But those are again. She likes technology is not really none of the men that you
would use on a daily basis. And in some of those Technologies, obviously the computer, uh, plays a role as well. Right? If you would only use, he use whenever you can. So that thing's gets better past other things that I'm on the regulatory side, the special function side. What you're looking for is effectively a way to go, go back
to archives. Like I said, and go look at that specific needle in the haystack, or, you know, something like a very specific type of query which is, which is not, which is not something that, Look at the storage is optimized for so in those kinds of setups. It's more important that you give an answer less important that you give an answer immediately. Of course. Yep. Yep. So therefore the trade-offs are different there, right? Yeah. Okay. Yeah. So again, I think I'm a little
more into this. It's basically the obviously
¶ How does data evolve as companies evolve?
like different companies will have like different use cases. Obviously one company might we may not be doing any machine learning for example, and so we might have organized our In one way, so what happens? Let's say because companies also evolve over time brake. So for example, we might just start by doing some tactical dashboards and stuff.
Then later on, we decide to add machine learning and then find that data scientists at the cost of their queries is either is very high either in terms of time or in terms of dollars or whatever. And then like, let's see another day. We start an analytic function like it's a sort of a typical Evolution that a lot of companies go through. I mean, not me. Not be mature companies, but at least a lot of smaller companies go through this Evolution. So so what do you do?
I mean, like, how do you engineer this? Do you can't keep the engineering every few years and it's difficult to predict also, right? So what do you want? How do companies deal with this? So, actually, I wouldn't, I would say it's actually par for the course to re-engineer on. Okay, that happens all the time. I mean, like I said, most companies evolve organically and while that happens, the needs do change and needs to change. The software does need to change, right?
I think that's some, in my view inevitable, but I think we design principles. They hold the test of time, right? They hold true. So you have to think about, and it takes a really present leader to think about it from day. One that let me actually collect everything and I'll figure it out later. Right? So, you take that kind of an approach then be LCD, which is the big data storage layer that stands the test of time, right? Because that's going to be to
come. What may then at various points? You Branch off obviously? And one of the key like conference is one of the key evolutionary decisions that most companies go through is. I've always collected data on that, but certainly now this is have evolved. So I To get my answers done in real time. Yep. So what do I need to go back and change and the answer that is going to change everything from the source application on words, right here.
Because we are, real time is not just a property of your pipeline running faster. It's also whether the source can actually give you the data on the first place in a way that's much faster than whatever's here. Is it meant for? So, those are all critical decisions will have to go through that and that regenerating is inevitable there. But the part around, you know, the agility, right? That's a very key thing because, Was what you don't want to do.
Is get stuck with a technology, especially if its propriety, or it has two black boxes, right? That prevent you from doing additional flexible things here, and more importantly that has a cost of compliance, right? That there is a cost of technical debt that you incur is because you're not taught through the consumption pattern and you live with that cost until the point that you're ready to throw away or until the point. You're ready to bring in.
Brand-new technology of brand new use case. So it kind of becomes a case of risk aversion and and cost optimization than a case of, you know, can I choose the right technology right now? Right. So you have to think about that angle is well, right? Then you can you decide something right now. Yeah, and so you're saying that re-engineering is fairly common.
Okay. Okay, because I mean, what I find, is that like sometimes you Certainly said, the some new guy will come in and he wants to look at the data in a Cell, very different way, for example, and that will be very different from what has ever been done in the company and the current structures and so on Earth, just completely unsuited for that or
impose high cost. So they're even case this new way is going to be sustainable and so on, you recommend a sort of a complete re-engineering retooling, kind of a thing happened. Kathy. I think that's a very important question to consider. From multiple angles, right? I think, the first angle is technology. Naturally. There's no doubt about that. But the second angle is also the sustainability of that kind of an approach, right?
So what is the question leaves? And there's a very common question that lines us being most most Enterprises go through, right? So then are you stuck with technology? Which only one person understands? Yeah, right. And that's a, that's a pretty common problem in the world of data processing and software, right? I think it used to be the case that Ours was extremely popular,
right? And and, and still dislike for another analysis and statistics in use it, but then now I know exactly like many people do. And then you now have these
¶ Data warehouse and data lake and data marts and other jargons
graduates who are just weird-- by python in, right? For everything. And then they tend to think that like statistics for instance are all possible just within python in the world of python itself. Yeah. Now, if you're a company who which is built up an entire data science or analytics team that's just based on our and for whatever. Moved on to their next greener pastures, right? Yep. And then fire, the all these pythons graduates in what you do? With all that our investment
that you have already made. Yep. You have to think about those pipelines all over again, especially with a cloud coming in the pace at which the newer releases of software, like, soften as a general term. Yep are just data. They're not just data science to get something done. These releases are just happening. Almost on a monthly basis, right? Yep. So and each one has a like a, like a very good feature and
these are being tested. Third at massive scale Enterprises much bigger than yours. So it becomes almost like a no-brainer to adopt the fact that, you know, like Netflix is doing it. Let me do it right. Yep. And then the moment you do that, then you start to realize that some of these pipelines that have wilted obsolete. You have to re-engineer them. So, engineering is actually earned the Mantra in the cloud world.
I would say, I just want to see on some of the, some of the technical terms, which have around this, which I've never really understood people talk about, I mean, the common word, select data warehouse. Then there's Literally, at people say, Aroma some guy telling me a few years back, data warehouses are now obsolete, everything. See the data leak, but the way I understand it, like both of them, sort of like, coexisting things are so. So what is the difference between them?
And how do you, what is the? I mean in a Layman's language? How's the, what is the difference between in terms of how the data sets depending on how you choose the architecture? I would say that. The data lake is probably the least common denominator right wing, but he talking picture. Yeah. I said, as the name suggests. It's the, it's the point is the place where data naturally
gravitates towards, right? So here is the analogy is of a water body and you have all these Rivers like coming in, rather all these streams and abilities. And even they would gravitate towards the point where That is this equilibrium where the water can accumulate here and that becomes the central place from where, what are the further disseminates into multiple multiple streams for the downstream and things that. Right. So that's the that's the analogy which is quite valid for the
data Lake as well. Now, obviously the data Lake by its very nature because data is gravitating towards it. You don't really control typically. What is the quality of the data? That's gravitating it. So it's not something that you say. I'm going to only allow secure your data, right? Doesn't make sense because we have the lake. The concept of Purity is a, is a is a Latter-Day construct,
right? You don't put it up, put it up right up with the the beginning when you're trying to complete our concept of Purity can change over time. So what is pure now is not clear tomorrow. So yeah, and in fact there is Merit in storing your data as well. Of course, you have the other angle of it right here. You need to know what improving things. So now that's the data you then. Not to W by itself because it cannot guarantee things like Purity and quality. You need to have mechanisms of
doing that. The moment you have, you put those mechanisms of guaranteeing Purity and quality you start to construct more curated data sets. Yep. Yep, and those security datasets serves. Let's say a more specialized needs like one of those use cases that we talked about. We are another point. You start to think about how specialized should this be? And what is the property, right? Property of curation, that that is being critically relied upon by that the business consumer and so on.
So for instance, if you're basing businesses business decisions, right? On top of it, you have to absolutely give the guarantee that the data is correct. If the data is wrong, then you making wrong decisions in potentially affect people's lives in some form, right? We have so that process of creation and the creation of that curated data set is a is what typically, you know, classical world would happen in our data.
Our house, okay. Okay data warehouse becomes this Enterprise void single curated zone of Truth. Okay. Yep, right. That's the classical definition of a data warehouse. The reason it's enterprise-wide is because it is meant to serve all business functions yet. The 8hr supply chain, like, payroll sentence, Regulatory Compliance. And and of course, the Top Line function sales, this Allison.
Things that I'm so because it was enterprise-wide, it, had this very Grand appeal for the classical ideal 06, cios, and so on. Who would think of it as a job, like a multi-month multi, multi your kind of a project to make sure that let me just build it out once. And if I figured out once and from that point on, I can, you know, wash my hands off and then incrementally, just keep building things. And it will always like, you know, survive the test of time and you're giving me the right
information. Now while that work wonders for some of the large mature companies, which have a very good process of curation. What ended up happening was the smaller companies became smaller. I would say, like nimbler companies companies born in the digital Iran. They didn't need the Enterprise wide view, right? Because here for them, heads up with a completely different function from the Top Line
function. So it became more important for them to build this for the Top Line function, like the business in the marketing function. And that was a lot more privacy and fight. Because if we have We happen, not move. But as he had often could wait, right? They could still operate out of excel. Sheets. Not a problem. Yeah, so then you start to think about can I get great quraysh directed zones, like this, specific to a business specific, to a function and that's where
this concept of a data. Mart, came into play. Right? Okay, um, in the older days data, Marts were always existing, but in the newer age, it became much easier to create data Marts because the needs were all completely disparate. Like it wasn't as though the Is one person who's responsible for all those needs to be served
together. Each function, started to do its own thing because we had to build their own agility and build in own speed of decision making and things that aren't so he demands became again a very popular activity right now. So data lake is where all the double scope like and they're from there you put in some quality checks. You kind of, make sure that like, there's nothing data is of the reasonable standard and stuff, then it goes into Data Warehouse, which and data warehouse.
I am I from my memory predates. Big data and so on the correct me if I'm wrong. Clean up themselves, very old concept, I guess. So. And data warehouse used to be like in traditional. It's Enterprise while white. But so now data, Marts are like they serve particular businesses, where does datum, so, how is this? How does this connection happen? Is it data Lake State to data Mart, or do, does it flow through the data warehouse and also like when the data values?
Obviously like, I mean, I think the assumption that you build it, once it serves you forever. It's pretty much doesn't work in most places. In a few places it right away in most places. It doesn't work. So, does that mean you keep updating the data warehouse, or after point? You just give up on it and just let it be where it is. Send, like construct new data, Marts in things like that. What do you what is, what is the way to go about it? I mean, I guess it gets a bit
controversial right to see. My view is I don't think I don't think of data warehouse is make sense anymore. Okay. Okay, except for very specialized circumstances, right? Where it's a very heavily heavily regulated industry. The extending matured. And therefore there is a bit. There is more meditation centralizing the view of the
data and then decentralizing. Yeah, but if you look at now, the modern modern data-driven Enterprise, the Mantra is that each function is is onto its own, right? They make their own decisions. They make their own schedules and Agility work. Yep, and for that to happen, the dissemination of the Has lot more important than you know, introducing Layton sees and blockers in order to centralized data, right? So yeah definitely vision is already happened in the data lake.
So why create? And yet another centralized own better you just disseminate as soon as possible and let the functions themselves build their own views of what they want to do, right? Yeah, but they live by it and the debates that means you all of them to happen. You all of that to happen. So that becomes lot more. Hopefully, as a mammogram so directly answering your question. I don't think the warehouse make sense anymore, the classical way of looking at it.
It makes a lot more sense to just take the data from the data. They can directly create these purpose-built data Marts as soon as possible, but isn't the downside of that they could create silos. So for example, like for a company, I consulted for like some 10 years back. There's someone there were two days data sets which were being one, was maintained by the
finance. My one by the HRT is an external person got access to both and emerge on that through some insights, which they couldn't validate because the finance team couldn't see the HR teams data and vice versa. So does a data warehouse, I guess is a is one place where
¶ How to avoid silos, and whether to centralise data engineering, analytics, etc.
everything is there. But data my tag is like, how do you solve this problem of silo-based thinking within? Yeah, and I think there is two or three simple things you can do, right? One, is that they could be. Actually, for instance, are like, like a Parent determined so to speak, right? And wherever you see that there are multiple functions requiring access to the same view of the business. It makes sense to just create one business data Mart, which serves that view.
And then each of these different functions without takes its own cut. Somebody and that makes sense. Obviously, right then the other angle here is the data warehouse could be still very relevant technology, that kind of a setup when there are many functions which want the same view, but with the crew, You think the crucial difference from the classical way of looking at it beam? You don't need to boil the ocean to build that enterprise-wide
centralization, right? Yeah, you could just start with something which is very simple, which is just given by these three functional requirements of these three functions, want you to look at the same data and as the functions, get on-boarded, right as more and more people want the data, then you start to see that there are patterns and then you start to like we've it back into this form, a layer, like so logically, you could eventually still end up. Building on like an
enterprise-wide data warehouse, but I guess the, the guts of what I'm trying to say is you don't need to start by building it, right? Yeah. Okay. Yeah, so, that, that completely makes is I mean, it makes sense to start to the couple of basic use cases and then incrementally, build it out rather than they spend a year just building a data. By which today, I don't think any company has the time for
that in things like that. When you have to keep creating data, might send you to create maintaining data might say. So how do you organizationally? How do you see this? Like we will do, see there's going to be a sort of a central Data engineering team, which should do this or how, how should this be sort of like structure? Yeah, I think being the key is again, what is being guaranteed to whom right getting? That's the, that's the, that's the Crux of the requirements, right?
If you want to, just boil it while all that down to one question. Yeah, which is that, if you're guaranteeing, that I as a As a as a creator of a data Lake, I'm giving you the access to any and all data produced in the company. Like that's a very strong guarantee. And if you're able to give that guarantee, then that completely decouples, anybody who wants any data from having to go talk to the source system directory.
Because the last thing you want is every function, talking to every source and saying, give me a copy. Also give you a copy also fight because it becomes like a can cross him, if that order n square is problems. So and it's extremely bad, right? Again of a principal and whatever we do in technology, if you say it's order, n Square. Imagine if now, people are talking to each other like that, right? I guess the night man. Yeah.
So from that standpoint, the creator of the data lake or the owner of the data lake is actually playing an extremely crucial goal, because he goes a guarantee. Now, if you need to talk to the stores anymore, right? Yeah, I'm talk to me. I'll give you whatever data you want. I got of a frequency, whatever. Velocity X attacks occur, right? Yeah. So that's one. Getting me then. The next guarantee is that, if you're in the moment, you're
putting a cure. Raishin, what you're doing is actually also interpreting the data in a certain way and therefore it's not going to be applicable for all functions because some of the functions might want to interpret in their own way, right? Yep. So the some of the best companies I've seen they tend to take a decentralized approach when it comes to consumption. Okay, they don't they don't like the engineering team.
So they just say look, my job is to guarantee that it has available and I've done that from now on. If you want to build things on top of it, just go, okay. I've made these interfaces Democratic, right? We have anybody can consume and whenever you consume, I'm going to charge you back, right? It's like a chargeback model. Yeah, so I'm going to forget and I'm going to charge it back saying, okay, you consume this much amount of data, but then whatever you want to do on top of it.
That's up to you. You spin up your own engineering team and I think that works better because it decouples the agility from one of wanting from another. Yep. Yep. So again, in terms of again, there is a thinking in terms of organization. They like some organizations have. I mean, I'm coming to my Pacific need which is like analytics. Like we're like some company companies, have one centralized, centralized analytics team, which takes care of all the analytics needs of the company.
Other companies sort of like mind. We have multiple analytics, seems there might be some centralized teams, but also like small analytics team will be a marketing analytics team. There will be an HR and extreme and so on. So so in that sense in analytic, I again, it's a good debate in analytics, but also in terms of engineering like you think the data engineering again, get split across Teams like this in terms of especially creating
these data much. So let's assume that data lake is own centrally and guarantees you. So the 0 n Square problem is salt. So the data lake is a good source of Truth for everyone, but for people to build their own keep this not everyone is adapted writing SQL queries, especially AI University. So, how do you how do you sort of the war? What are the trade-offs? Their eggs? Warm? Yeah, yeah. Yeah, I think we the moment you start to think more than one
function, right? And more than one function, wanting the same curated view of the data. Yeah, then I think that's the right time to ask this question. Where should that curation happen? Okay, where should the team decide who's doing? The duration? Should the team report to that one function of the other function or should it should be a separate team? Which has, which has its own reporting lines. And that's all, that's an
organizational question. I don't think there is any right answer to it. But in my view, the data exists to so business, right? Yeah, that's the fundamental frequency. So if that's the case, then the team that works for the data, should also be very closely aligned with business, not the writing, of course, like so and so by that extension analytics reporting data science and data engineering should all be ideally be aligned to the
business. Yeah, not Beyond centralized thing that's reporting into ID. That's my view. To be more specific, the engineering that's required on data. Ultimately comes in the form of requirements from the, from the analytics team, which in turn gets the requirements from the business game. Like that's the logical flow of requirements. So in which case the data General theme is actually not
operating in silos, right? It's actually operating very closely aligned with what the business eventually wants to see. Now I know is it the enterprise-wide business that we talking about or is it just a function that functional business? That's the call that he loved to make and accordingly? You like to take a poll on whether that team that the, in fact, set of skill sets to be centralized, one doesn't need to be aligned to the function or not for it.
And I guess the other variable here is cost as in dollar cost rate because if you have a lot of data mites, then you might end up like having the same data, duplicate it in multiple places, which NF can push up your costs rather than getting everybody to subscribe to the same data warehouse. Absolutely. Absolutely. And I think there is a, it's a very easy way to deal with it.
That's a classical concept has existed from the way in the days of their housing, which is of course data governance. Let so you need to have a governance team and the governance team plays a critical role in ensuring that, you know, there isn't too much abuse going on of the data right in the form of even things, like, you know, people using it for nefarious purposes or purposes beyond what it's intended like especially with things like gdpr and privacy laws coming into play.
But it also plays this very simple goal, which is the fact that what data sets exist across the company All right, okay, and for every way in which people process data, is there, another copy of it which which is resulting in that practice, so that interpretation. So those are also actually part of the remit of the data governance team and that data governance team should report into the you know, what, in the classical world would look like a risk function, right?
Yeah, because ultimately you're basing business decisions on Based on data and the data is wrong. It amounts to the risk, right? So grip of that data governance in between as part of the risk of mediator. Those God or Enterprise risk, and then they are responsible for ensuring that you know, nothing untoward happens. Or you know, people are not wasting time or resources or or misinterpreting data. Okay, so, okay, got it. Now, let's sort of like, I think we will take another step back.
I think a while back you had or find mention that we will discuss this later that Hadoop imposed other kinds of costs.
¶ More on Hadoop
So I'm taking a very big leap from what we were discussing about how to our The governance and so on. So what is this other set of cost that Hadoop imposed and Liz's? It's perfect. So I think he liked would all Technologies. There are pros and cons and Nadu had its fair share of cons as well. Right? I think the biggest one was the fact that technologically it was such a complex Beast. Okay?
To manage that, it was not, the didn't have the cyclic form factor of, you know, let's just deploy something from from the internet and then we're done, right? To get right. A variety of different things in sequence right often times. And if you don't do it, right, then you have to go back and start again. Now, the good part is because it's open source, there is enough and more support that was available, right in the form of community. Brethren.
People were trying to do the same thing here. Then that didn't really calls the pain, right? The pain was the fact that it was just a technologically, complex thing to manage. That's one thing. The second thing is that, The problem with the you know open to anything can access pattern, is that you're not really optimized for anything either. Yep. So what this meant is why it's great for Discovery. Kind of use cases data science. Use cases.
Things are just need to run without requiring to run immediately, right? The moment you start to impose business, criticality. Like you have to have the answer. Now, the answer has to be correct all the time that kind of thing, right? Then you start to run into problems of You know that the technology does not need to be supporting it. Yeah, and then you have to layer your own Technologies on top of it. Right? So what ended up happening was an already complicated Beast.
Now started to have all these platter of an umbrella technologies that were required in order to serve very specific access patterns. And guess what? Because of the, you know, the Landscaping so wide. At any point in time, you could have a failure. That one thing was incompatible with Another yes, of course. Yep, right became a problem of, how do you keep this entire mass of software certified against each other as everything evolve in parallel, right? Yep.
Hello as a result. There was a lot of fragmentation and that fragmentation resulted in enormous, confusion in the minds of n. And n, The Price is Right what to think about down like they are fighting to do the same thing. What do I do like so architecture became a problem. Like, how do you reason about something because Was very easy to reason about in the past noun, you couldn't do it anymore. You need to have a separate skill set to reason about
things. And I think, one of the last times last thing I wanted to point out there in the Hadoop world that came in much later, was that as newer Technologies, became lot more popular, especially on the cloud because of the very fact that term, it's community supported the moment, the community went away. Yep, you know, we're stuck with something that is not just obsolete from a technology standpoint. It Dooley didn't have. Have any kind of support anymore?
Yep, like so what if it's open source, we doesn't solve the problem. Right? Just looking P. Inside the code, doesn't mean your life is any easier? Yes. Oh, yeah. There was a fallacious amount of thinking. They're right open source means white box axis, and I don't need to think. I don't need to worry about being locked in, but actually open source. Is also a form of log, n, right.
So then that became a huge challenge for many companies and then in fact, what couple of the very popular libraries which actually started off. I do No in a big way, they were all just recently announced by the Apache Foundation as being end of life now. Okay? Okay, so what happened? So so how do pregnant all these problems that you mentioned? So I'm sure we have figured out a solution for that. So what what is replaced her? How do, how did that transition take place in so on?
Yeah, and I was kind of mentioning about this before Cloud, the cloud players, right? Really were clever about it. They just took P the best of best of Worlds, right? So even some Hadoop, they took all the pros of flexibility agility Open Access patterns and things that. And then they papered over the corn side of it. See one of the things about Cloud, right? It's really driving home. The value proposition that Administration of software administration.
Of technology is no longer something that any Enterprise needs to spend money on tight. That's actually a very key value proposition of cloud. So we have you do not need to spend it admin time, trying to manage your manager machines, right? Because it's just manageable automatically by its own using software using automation Yellow by extending that logic. They really address this first pinpoint of Hadoop, which is this technological Beast. Now, you could argue with the cloud.
Some providers for the cloud is also extremely technologically. Complex is so many Technologies, so many ways of putting things together, but the good thing is called also provides the automation on top of it, right? So you can just fire up readily available, automation scripts, and then just spin up this massive Technologies and spin it down as well, keeping control and cost, keeping control in the complex plane.
So they extended that Paradigm for Hadoop and that solve that particular problem of, you know, how do I get this entire thing up and running? So, so now, when Cloud, right? It's a As simple as just positioning using a button. Click, of course. Yeah, you get all that. Pick it up - strike. So that's a big thing. The second thing is that the fact that Hadoop at this problem of not being optimized for any specific access pattern there.
The mean the way Cloud Solve It Is by by its fundamental design. That cloud. Fundamental fundamental design is hurt. Everything is decoupled and I'm going to do something that's very specialized for me by myself. And for everything else, I defer to somebody else who's better than me. Me, right? So it's also called this. It's also called as a micro services, or Services oriented architecture. Yep. Cloud is fun. We build on that.
Right? So everything is a service and if I if one service needs to do something, which is not its core capability, it will Outsource it to some other service together delegated to some of the surface to get it done. And then once that service gets it done, it brings it back in. So, as an example, if I were to now construct, a big data pipeline in the cloud storage is a dedicated service. Yes, like being able to run. Data science model is a dedicated service like Sage maker on it.
There's an example right now. Once the model is run. I need the output to be visualized as a, as a, as a pretty chart. That's a separate service altogether, right? And the way that all these things need to talk to each other, the orchestration that's a separate service and for all the things to work correctly, like in terms of like failures and things like that. And alerting logging such a separate service, right? Yep. So in putting it all together,
there is automation. But automation will say, okay, all these things need to talk to you. Juror, and, and if I need to do something very specialized. I just need to spin up service and add it to my automation. I'm done. Like, so the access to optimized ways of looking, at data in the cloud is, as simple as just configuration. Yep, and I don't need to deal with this plethora of Open Source software, which I don't know who's going to support is not going to support that
ability. All of them. Now, taken care of right? So Cloud solve for that in a way that is fundamental to the way the cloud is designed. And by Cloud, I guess you mean companies like Amazon. I will make this up which provide these big hosted the clouds are. So let's say AWS manages this thing about they produce a Jamaica the told you they see to S3 all those things and they provide you the connections
between them. You can spin up whatever you want at any point in time and like they manage the whole thing. So so that's so that's the value that each. Okay people. I think we're now tying back all our discussion, right? Leg in terms of how hard you have you organize the data of you kind of what big data is like the what are the pros and
¶ How should a startup architect its data team (no pun intended)?
cons of Hadoop? How the Technology has evolved into one now. Suppose. We are like, let's say there's this new startup. Okay, which currently doesn't have too much data, but like, you know, that you're going to be collecting tons of data and
things. As a, how do you kind of go about architecting your entire the data team for the lack of a better phrase in terms of like so that it's geared up for growth but also likes of which includes things like how how you store your data, how we kind of organize your databases and all those things. Like how would how should a company? See that starting of now, look at it. Yeah, I think we at the first term choice would be perceived. Just just to decide the cloud or
not, right? Yep. I think that's the first question to answer in my view. The cloud is a no-brainer. I don't think any setup would be wise to not consider the cloud. Even the innovations that are happening, that, of course. Well, let's assume the answer to that is yes, right? Yes, Lord. Yes. Lord is upon me. Now, if cloud is a must the good part is the Architectural patterns that that are applicable for constructing a
data estate on the cloud, right? When you can follow the latest a data platform, whatever it is. Yeah, but the architectural patterns are or, and in such a way, that it doesn't matter, whether you're small or large. Okay, I guess we'll start with the same architectural pattern. And as you grow as a company, they are the execution of the DACA texture, can just seamlessly grow along with you like without having to change anything significantly. Yeah, and that, that's the
biggest benefit of doing. The cloud because each of these components are individually, elastically scalable, right? So as an example of a new company, you're collecting starting to collect data. In fact, I would say data, is one of the critical modes right for most startups. Yep, because you need to build that mode over a period of time. And the best way to build up mode is actually round the data Lake, right? Yes. Okay, out of the Crater Lake, just collect everything.
It's Dirt Cheap. It doesn't matter what you put in, obviously, the more curation you can, Do the better it is for you in the long run and the simplest form of creation is that you can do is just track where the data is coming from, right? They get what is called as you know traceability you want to be able to trace back like five years from now that this data was actually collected by that version of the software that I deployed on that particular
machine. Yep, if you just able to reason about that, that's more than enough, right? So the data link is a very good foundational pattern after you need that and then don't spin up Technologies until the end until and unless you need them, right? Yeah, I think you can put Mantra and that matter of holds true for taking for data as well.
Right? You don't need to understand, sorry over engineered with a warehouse or data Mart and things like that, maybe to start off with you just have a very simple open source database like postgres a my SQL, and that will get you off the door and keep you running for two years and three years without entering much cost at all.
Yep. And then the more you the more you play with it. Then you realize that the same database serving the need for both your source application as Well as the need for analytics is kind of creating a bottleneck. So at the point you start to say, okay, maybe I'll spin up something else using my data Lake as the source, not my post as a service anymore as a perm, like a brand-new, like, on Amazon could be like redshift as an example. Right now. Let's plug in my the. I do not top of it.
I started to analytics. The moment. I do this have decoupled down the source application from my analytics. And then, as you start to build up on top of it now, see, other thing to keep in mind is that Most data driven businesses. Ultimately want to take the output of analytics and tied back to the source application.
If like, as an example, if I want to influence a customer who's actually using mice by application right now, obviously, I'm collecting all the data but then the analytics house as to feedback as a recommendation as an example,
right way. So that ability to close the loop back to the source application is a is again a critical capability and that's something that you want to do from day one because it Alfred you might not actually have any intelligence, it could just be rule-based, right? Yes. Again as you mature, as you change these data pipelines and that becomes a lot more sophisticated, feel the looping mechanism allows you the
feedback from day one. So that you know, that that feedback is also something we can keep learning on. Thank you for listening to data shatter. If you like this show, please leave a comment, share and subscribe to the podcast. You can find this podcast on Apple podcasts Spotify or wherever else you go to get your podcast. Once again, the staff exciting one. Thank you.
