Scaling Data Operations With Platform Engineering - podcast episode cover

Scaling Data Operations With Platform Engineering

May 29, 202542 minEp. 466
--:--
--:--
Listen in podcast apps:
Metacast
Spotify
Youtube
RSS

Episode description

Summary
In this episode of the Data Engineering Podcast Chakravarthy Kotaru talks about scaling data operations through standardized platform offerings. From his roots as an Oracle developer to leading the data platform at a major online travel company, Chakravarthy shares insights on managing diverse database technologies and providing databases as a service to streamline operations. He explains how his team has transitioned from DevOps to a platform engineering approach, centralizing expertise and automating repetitive tasks with AWS Service Catalog. Join them as they discuss the challenges of migrating legacy systems, integrating AI and ML for automation, and the importance of organizational buy-in in driving data platform success.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
  • This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th.
  • Your host is Tobias Macey and today I'm interviewing Chakri Kotaru about scaling successful data operations through standardized platform offerings
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by outlining the different ways that you have seen teams you work with fail due to lack of structure and opinionated design?
  • Why NoSQL?
  • Pairing different styles of NoSQL for different problems
  • Useful patterns for each NoSQL style (document, column family, graph, etc.)
  • Challenges in platform automation and scaling edge cases
  • What challenges do you anticipate as a result of the new pressures as a result of AI applications?
  • What are the most interesting, innovative, or unexpected ways that you have seen platform engineering practices applied to data systems?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform engineering?
  • When is NoSQL the wrong choice?
  • What do you have planned for the future of platform principles for enabling data teams/data applications?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Transcript

Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details. Your host is Tobias Macy, and today, I'm interviewing Chakravarthy Kotaru about scaling successful data operations through standardized platform offerings and being able to provide databases as a service at scale. So, Chakrabarti, can you start by introducing yourself?

Sure. Thanks for having me. I'm Chakrabarti Kotaru. I work as a director of data platform for a leading online travel company, and I have nearly two decades of experience with scalable architectures, especially in data stores, data governance, and security. I have spent a big part of my career focusing on building data platforms and public and private clouds. And do you remember how you first got started working in data?

It was right out of recession. I applied for, like, 1,200 internships. I got two. One is from Walt Disney World, Florida. That was, like, a really great experience, so I started as an Oracle developer. And, after that, I joined a major insurance company as an database developer, but then the project got cracked. And my manager, at that time, gave me an option of you learn operations,

database administration, or go find another job. So I was like, okay. I'll learn whatever it takes. So I started picking up, different NoSQL databases. At at that time, I specifically remember I started, React, which is bankrupt now. It it's similar to Cassandra, but that's what gave me opportunity to explore different NoSQL databases in addition to relational databases. And that's where we started building private a data platform in the private cloud.

In terms of the idea of databases as a service, data platforms, obviously, a lot of the cloud providers have that as one of their offerings with generally the focus being, I want a database to be able to use for whatever application I'm building.

I'm wondering if you can just start by giving some overview about some of the ways that those database as a service offerings that are part of that core offering from the cloud providers are maybe insufficient, and some of the challenges that you ran into as the person responsible for that data layer and how that led to different failure modes for the teams that you were supporting.

Yes. So, our philosophy is mostly, like, you know, like, we don't want to select a few databases and have use cases designed around them. We wanted to specifically have a use case which solves a problem and figure out what is the right database for it. So we don't want to put everything in relational or put DynamoDB or put everything in a select databases that you are comfortable with.

But if you are doing a search, if you are doing full text search, we want to go and get the best best search engine for that, which is Elasticsearch. So specifically targeted databases for specific applications that will give you the best and overall return of your investment and best use case experience. So that's that's one reason why you use multiple databases. Another reason is obviously, you know, big companies,

mergers, and acquisitions. Different companies use different things, and then when they come together, it's not straightforward overnight to change everything. In terms of the different options for database types, database use cases, you mentioned things like Elasticsearch and DynamoDB. Obviously, there are various relational databases that are on offer.

And in terms of the particular data use cases that you were supporting, I'm wondering if you can give a bit of an overview and some of the reasons why you think these various NoSQL offerings were the most relevant and and effective options for the problems that you were looking to support. Right. So the major ones that we support are, like, relational. We have a lot of SQL Server that we are trying to migrate to a cloud, native database,

but that is traditional. Like, you you know, ten years back, mostly everything is relational, so we have a lot of tech debt on it. Mostly transactional databases for payments and bookings, those kind of things. And as we break those monolith and create more microservices and we transition to the cloud, we are exploring opportunities to, you know, what can be moved to MongoDB so that we have, you know, better, decoupling the schema and giving more flexibility to developers.

At the same time, cache use cases, heavily on Redis, which, we are actively working on to, you know, migrate to more cloud native databases, especially with all the license changes that are going, all over the open source community. So if you look into low latency rights, like, anytime we are, working around use cases, which really requires, low latency rights, we prefer Cassandra or Rescale, CLLR DB, a key value,

database like that. So there are different things to keep in mind when we are selecting these databases. Obviously, the CAP theorem, what is important to you? Is it like consistency or availability or, partition tolerance? Based on that, we pick and select a database for the best use case.

And in terms of the scale of challenge that you're supporting in your current role, I'm wondering if you can talk to some of the ways that the existing platform offerings were starting to run into challenges or some of the edge cases that you dealt with or some of the ways that the teams were maybe not using those systems to their best effect and the ways that you've thought about how to make that more of a standardized offering so that it was easy.

Yes. So it's multiple things. So the platform offering is one for scale and quicker adoption, and with it's for governance and, making sure all the best practices are followed. So when we initially started, like, let's say, one org or a one section of the company that we were supporting, When we started looking into databases, a few things are happening. With the DevOps culture, everybody started building up their own MongoDB. But when developers

want to set it up, they set it up with an intention to quickly develop their code and go to market. They don't pay too much attention to is the database has the right parameters. Does it have, you know, authentication? Does it have SSL? All of all of this very deep, you know, nitty gritties of the databases. So when we started looking into it, we found things like databases doesn't have basic authentication sometimes. They don't use the right parameters

for that specific access patterns. So what ends up happening is either they six months after that, they came and they come and say, MongoDB is not the right use case because I'm not able my I'm not getting the performance I needed. Or they will throw infrastructure at it and 10 x the box to shield performance issues, which will increase the cost. So to solve that problem, because it was one organization, we started, using infrastructure as code, Terraform,

and, EC two and all of that, and came up with a basic platform which creates all of these databases. It it was a simple implementation of a data platform because it's only, like, four different AWS accounts. Okay. We have a Terraform plan, template for MongoDB, a Terraform template for Cassandra.

And once the developers use that, it creates that specific cluster. For example, Cassandra, it creates a six node cluster with all the right DC configurations and everything. So it shields the developers from all the nitty gritties of databases. That itself is the right value a value add right away, which saves their time. And at the same time, it was able to give a lot of, like, governance

benefits for the company because now I have authentication as default. Now I have, you know, multi DC as default in case of disaster recovery and things like that. But the scale, it it was good. It worked. But then, you know, at some point, the company started consolidating all of the platform. So we went from one arc, which is four AWS accounts, to almost 400 AWS accounts, which is, like, just blew out of proportion. And we realized that the current solution won't work, so we started looking into, you know, a a data platform that that can work at scale. That's when we started looking into service catalog and different, cloud providers.

And, with service catalog, we use that HubSpot model where we were able to solve that scale issue. Now it doesn't matter if it's 400 accounts or 4,000 accounts or whether it's, like, a hundred different database clusters or 8,000 clusters. We were able to centrally deploy, manage, and, you know, monitor all of them at scale. In terms of the teams that you were working with, what were some of the characteristics

of the other engineers and their level of familiarity with these data layers and the level of attention that they wanted to pay to that aspect. Because I know from working of teams of various compositions,

developers, they just wanna say, just give me something that I could throw data at. I don't wanna have to care about the rest of it. Whereas if you're dealing with data engineers, they're going to be much more hands on about selecting the actual storage layer technologies and the ways that they're interoperating.

We prefer the second one, because, like, the one on part of this platform building and everything, there is also a consultancy service. Right? We like to be involved from day one where you have a use case. The data side of folks get involved so that we can figure out what is the best database technology. Because there were instances where the POC started and because

before you realize the POC ended, it it was productionized. And by the time you figure out that this is not the right technology, maybe you can move from Redis to, like, you know, some other database, it's already too late. So because you are committed to release a product and you go to production with that. And just by changing the back end database, in some cases, we were able to save, like, $2,000,000.

So so it's very important to get the right database technology. Right? So as far as what kind of mix we have, we have we're considering a big travel platform, travel,

online travel company. We have almost all kinds of teams. Like, some teams, they're really good at what they do, and they have been managing some of these databases for a long time, so they know all the integrity. They are like, hey. We don't need to onboard to the data platform. We are good managing it ourselves. We respect that. So it's not that the whole point of data platform is not to impose something on everybody. It's to help them unblock and move them quickly,

to reach their goal. So but majority of the company, like, 90 to 95% of the company, they were like, wow. This is great. I don't have to worry about infrastructure. I don't have to worry about 24 by seven support or performance tuning. You take care of that. I am happy writing code. So majority of that,

that's what I have seen across the industry as well. Whenever I go and present this data platform related issues, most of the developer community is happy offloading that to somebody else, and they focus on, you know, building the next tiny thing.

And so in terms of the options and offerings that you were building out to support those different styles of team, I'm wondering how you thought about the base case of I just wanna be able to throw up a database and be able to start working with it and then having levels of complexity that you're able to expose for the teams that wanted to have more control and fine tuning of the size or scale or throughput of the different data layers?

Yes. So, it's obviously, like, one size doesn't fit all. So So whenever we are providing these options, there are multiple parameters. There will be an intake form where, they will be able to write what is their, latency expectations, what would be their throughput or the data size, and how it will grow in the next two years, three years, five years. All of this information is collected.

Based on that, obviously, when you let them configure everything, it doesn't make sense. I mean, it adds a lot of complexity on their end. So we have different sizes, small size database, mid size database, or a large size database. So based on the information that they have entered, we output that

okay. For based on your your information, you can go with a medium sized infrastructure, which can be a six node Cassandra cluster. For example, if it's a large size, it can be, like, a 13 node Cassandra cluster. So abstracting all of the technical details also will help them. They will say, hey. This is my requirement. Tell me how big of a cluster I need, and they will go and select that, and it deploys that big of a cluster. And then another aspect of providing these systems as a service is working to

integrate into the standard workflow and tooling that the teams that you're supporting are already using. You mentioned that you were using Terraform, at least initially, for provisioning these setups. Not every team necessarily wants to get up to speed with Terraform. They just wanna be able to say, just click a button, the CI runs, and everything's great.

And I'm wondering how you thought about managing the interfaces that you're exposing to these different teams to be able to provision the resources that they need for their use cases. So they don't necessarily have to write Terraform. That's the beauty of it. Right? So earlier, they were writing Terraform to create a Cassandra cluster and all of that. Now we write the Terraform.

They have a JSON call with set number of parameters like the cluster size or, you know, what kind of database technology and specific parameters that they want to create. So that so even writing Terraform or anything like that is abstracted. So when we move to the, you know, the bigger phase two of the platform with service catalog where we are using service catalog and cloud formation,

they don't have to even write any code. They have two options. They can go to service catalog UI and select I need it, you know, Cassandra database, six nodes. These parameters I want to tweak, or they have another option of we built an API on top of service catalog where they will just call them API. And that they can integrate within their workflow. And anytime they are using,

repetitive testing environments where they build and destroy the clusters, they can put it as part of their code, and they can build the cluster. They do the testing, the code, or they productionize it. If it is a production cluster, they call it one time, and that infrastructure is available.

Obviously, in order to be able to provide these different database engines as a service, it requires a decent amount of familiarity with their operational characteristics, the scaling considerations, the ways that you need to manage orchestration of the nodes entering or leaving the cluster, fine tuning of the throughput, and that requires a lot of effort and time investment and usually, a lot of,

a lot of errors that come up as a result. I'm wondering if you can talk through some of the ways that you helped to build that operational familiarity and operational comfort of the different engines and the ways that you selected the engines that you wanted to support as part of that core offering. Yes. So, again, so when we are building the platform, the advantage is, like, let's say, company has 20 different teams and they need at least, like, you know, 10 different Cassandra. If they're running Cassandra, they need 10 different DBAs in integrated in their own part. That's the typical DevOps model. Now if we are bringing them

all to the platform team, we only need, like, two or three who are experts. So they set up everything, the configuration and everything, and everybody else uses it. So we we don't need every team expert embedded in their teams. So the platform team abstracts all of that. And once you have that, best practices are defined. This is how you need to create secondary indexes. Or, like, this is how you need to build your access patterns. They're all defined. And,

as I said, when we are involved from the day one, like, let's say there is a use case on day one, we can help them make sure that they don't make costly mistakes, which will hurt later, but add hurt to those access patterns and the best practices. And that will help, for the successful use case deployment. And so talking a bit more about the implementation of the platform management, you mentioned that you were using the AWS service catalog.

How did that help to enable the scaling of your platform offering and some of the other technologies that maybe you tried and failed with before you got to that solution? Yes. So, like, initially, as I said, we were using Terraform to manage four different accounts. But, with 400 accounts, it was not an option. So we tried, few things like, you know,

entirely coding a different platform with the subspoke model and things like that. But that's when when we started doing the research, we saw that this is a readily available option. Service catalog is not just for data platform or anything. Any product, if if you have an infrastructure as a code that you want to deploy, you can use service catalog. It can be an easy to instance to a network configuration to a database. So, when we started researching,

we came to know about service catalog. So we did a quick POC, And it just took off. Like, the when we saw that, okay, we were able to manage 400 different accounts and create database clusters within those accounts from a central place with one set of template that gave us the power, to quickly, you know, onboard all of these brands to one central platform and manage them. So yeah. This is a pharmaceutical ad for SOTA data quality. Do you suffer from chronic dashboard distrust?

Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of undiagnosed data quality syndrome, also known as UDQS. Ask your data team about SOTA. With SOTA metrics observability, you can track the health of your KPIs and metrics across the business, automatically detecting anomalies before your CEO does. It's 70% more accurate than industry benchmarks and the fastest in the category, analyzing 1,100,000,000 rows in just sixty four seconds.

And with collaborative data contracts, engineers and business can finally agree on what done looks like, so you can stop fighting over column names and start trusting your data again. Whether you're a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you.

Side effects of implementing soda may include increased trust in your metrics, reduced late night Slack emergencies, spontaneous high fives across departments, fewer meetings, and less back and forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a 1,000 plus dollar custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow soda's launch week, which starts on June 9.

And then another element of offering databases as a service is the consistency that you're offering in terms of the setup, the scalability, but also there's the security model and the requirements that exist around different types of data that you're working with, the ways that data can be propagated and moved between different systems.

And I'm curious how you worked through some of those elements of governance and setting expectations and requirements with the numerous teams that you were supporting and the organizational buy in that you had to get as you started implementing those various controls and constraints? Yes. So we we were focused on data platform. The data governance aspect like GDPR or, like, data scrubbing, we have a separate org just focusing on that.

But, basic data security like authentication, SSL on transit, and SSL at rest, we can control that now because we are not leaving it to the double developer to make sure that he's securing the data. Once it is in the platform, all of these are guaranteed. They are all free. So encryption at rest, encryption in transit, authentication, logging, audit, all of that is enabled, and it's all part of the best practices.

And in working with the different teams beyond just the core security elements, what were some of the challenges that they ran into as far as maybe they were running into limitations or various security constraints that you were enforcing and some of the ways that you worked to familiarize them and and and just document the standard practices and expectations that they would need to conform to to be able to use the platform that you are offering. Yeah.

So, obviously, there is a friction. Like, when you are already running a database, when we go and ask them to onboard to this platform, if it is a like to die like onboarding, like, if you are using a one particular database and you're just going to same database on the platform, it's pretty straightforward because you join to that cluster, stream the data, and remove the old infrastructure. But at the same time, we took the opportunity to modernize and

upgrade some of, update some of the database choices as well, like, moving from SQL Server to Aurora or Linux. So for example, sometimes, MongoDB to DocumentDB for various reasons. So that requires a lot of developer buy in because some form of, rewriting the code is involved. But,

again, it is your organizational goals. If they're aligned to your organizational goals, we can get enough traction for that, and we will able to move. The biggest challenge we have faced is, like, I would say developers worrying about losing control. Now if I have a performance issue, I can double the box, and I don't have to talk to anybody. But

if you are moving to the platform, there is an additional layer. That was a bigger culture change that we had to bring within the company that, hey. You know, we are not somebody that is stopping you to do good stuff. We are trying to accelerate you to do that good stuff. We are not like, you know the way I tell everybody is like, hey. We are just extension of your team taking care of a specific aspect of it. We are just your team focusing on this, so you don't have to worry about it.

That really helped us and, obviously, a lot of, communication, collaboration, and the team culture aspect of it. A lot of times, platforms bring silos even without realizing. But if you implement it right with great collaboration and building it with the interest of developers accelerating, it really works.

And then from that migration perspective where maybe you have a team that's using an existing database and they need to move to a different engine or even if it's the same engine, obviously, there are various uptime considerations and various challenges that exist in any migration, particularly if you're moving from an unconstrained or an unopinionated approach to a more opinionated and constrained system.

And I'm curious how you worked through some of those challenges of doing that data migration, particularly with considerations around uptime. That has still been a bigger challenge. Why we still run a lot of legacy, relational SQL stuff is because of the same thing. How do we migrate to more modern databases without, downtime?

And, also, you know, how do we refactor all the code, which is there is a code which is 15 years old. Nobody wants to touch it, honestly. Like, who wrote is gone? Who wrote that code is gone? People who are here, they're happy as long as it runs, and they don't know what to do if it breaks. So that is still the

a challenge that we are still holding us back on migrating the last portion of some of the stuff to the platform. But mostly, it's like, working with the NOC team and setting up schedules. We have monthly, maintenance windows where we control how much the downtime. It's we make sure that we do everything,

three steps and everything, and most of the time, it's just a flip. Okay? The cluster will run-in an expanded mode for a long time. And when we are ready, we take the two minutes downtime and flip it in case. But most of the time, the NoSQL databases like Cassandra and Mongoose, they give us the flexibility to do all of this migration without any downtime. In terms of the workloads and you mentioned the relational engines,

there are definitely a number of legitimate uses for relational systems. They're not necessarily always adaptable to a NoSQL use case. I'm wondering what your overall approach is as far as the relevance and utility of NoSQL compared to relational engines and maybe some of the challenges of scale that you're experiencing with those relational engines that will maybe motivate you to doing something that is more of a NoSQL flavor?

So, again, as I said, we are not coming up with a free idea of this is the database you need to use. There is a very valid use case still to use relation, and we still do a lot of our workloads run on Aurora and Postgres and MySQL, all of that. What we are trying to do is migrate more from a SQL Oracle kind of on prem setups to more cloud native relational databases. So we still will continue to use heavily relational databases, but in a more cloud native, relational,

setups. We don't want to move every use case to NoSQL. Transactions, bookings, they're all critical. They need a strict asset properties and things like that. So we'll continue to use those, on relational databases. What we want to do is use the specific tool for the specific use case.

Also, how do you, run it with the least cost. Like, if there is an open source databases database out there which can solve my use case without paying any license cost, we want to use that instead of paying heavy licenses or overhead cost around this legacy database.

So one of the other things that you mentioned earlier was the shift from a DevOps style approach to more of a platform engineering approach where, to begin with, you had these embedded experts who worked with all the different teams, but maybe didn't have as distributed of a set of knowledge around the different database engines, and now you've consolidated a lot of that expertise

into your platform team with a focus on the data layer. And I'm wondering how some of the ways that the lessons learned in that has maybe translated to other operational elements of your overall engineering stack to invest more in a platform versus DevOps style approach and some of the ways that those DevOps style resources have either shifted focus or some of the other types of work that they're still engaged in within those teams in an embedded fashion?

Yeah. So one thing is, like, platform engineering is not a replacement of DevOps. It's an enhancement of DevOps. Like, in stuff, you know, DevOps team taking care of hundred different things,

they just now take care of 10 different teams and, let the platform teams worry about that. So when we brought all of these engineers to, like, data and database engineers to our central team and built the platform that really worked. It really freed up a lot of time for DevOps people to because let's say you are a Java engineer or a CICD engineer. You may not know all aspects of networking or database or, you know, how to set up, 10 different services in AWS.

That will fill them with, you know, fatigue of learning hundred things to make one thing work and make free up a lot of their time so that they can focus on what they are good at. The lessons learned is, like, obviously, as I said, like, this is not a one size fits all. We don't want to build a platform

and enforce on every every organization in the company. The way we want to do it is reverse. We want to go and understand what are the current challenges, come and try to fix that using the platform. That's the biggest lesson learned. Like, anytime that, we are building other platforms or even in the future, I build a platform,

The fundamental concept is you don't start a platform team to build your own fort or your own org. It's like identify a challenge in the organization and help them solve it, not to have, like, 20 different reports added to you.

Another aspect of having that platform team and that more centralized capability, as you mentioned, is it can be more scalable and more efficient, but it also requires organizational support to invest in that dedicated resource versus having everybody be more generalized in the resource versus having everybody be more generalized and the expectation that everybody

manages their own requirements. And I'm wondering how you've seen some of the organizational investment and some of the ways that you've had to work with the broader organization to help promote the utility of having that centralized investment and centralized resource?

Yeah. Definitely, we have to start small. You can't just hire, like, 20 different people and say I'm starting a platform team. But you start small and, show the value. That's what we did. We just did it in one org where we started this platform concept and added value, showed the value in terms of monetary benefit, in terms of quality. Like, if you have a generalist in 20 different teams versus, you know, two or three specialists,

that adds a lot of value, especially when you are having an incident. When you're running an open source databases and you are having an incident, if you have a generalist, the probability of quickly solving it is less compared to when have when you have an expert. So when we

did the small setup for an almost eighteen months and we showed the company that this is the value add, that's when we started, you know, onboarding to different organizations and different brands and all of that. It took almost, like, two two and a half years to onboard all of the company because they were seeing value. And, like, one dev team started using the platform. They were like, okay. These are all the values that I'm getting. Obviously, they all talk in the dev communities, and that's how word-of-mouth

we never enforced it in the company. That's the beauty. Like, we never went, and say that you need to onboard a platform by so and so date. We said this is available. If you onboard, these are your benefits, and that's how it has grown organically. In your experience of building that capacity, working with the engineering teams, and helping to promote your approach in various conference presentations and talking to other organizations.

What are some of the most interesting or innovative or unexpected ways that you've seen that platform or capacity applied either within your own team and organization or as it translates to other teams and, companies?

I initially thought this multi account problem was specific to us because of the huge number of brands and accounts that we had. But once I started going out and talking in various conferences, it's like it's a lot lot of large companies have the same problem because, I think at one point, AWS starts you to add more and more accounts because they start, hitting limitations, IP limitations and resource limitations and things like that. At least now they have increased their,

caps a lot, but, like, five, six years back, we ran into I IT limitations in a lot of accounts. Like, we were not able to provision new EC two instances and things like that. So that's when they said, hey. Build a new account. If you had a new team, build a new account. That's how we ended up with, like, no, hundreds of accounts. And a lot of companies, a lot of major companies, that I spoke to have the same problem, and they started looking into the similar hub spoke model and,

centrally managing infrastructure, which has gave them a lot of, good results as well. And in your own work of building out this capacity, investing in the engineering effort and the management effort required to bring the company along and see engineering success from it? What are some of the most interesting or unexpected or challenging lessons that you learned personally? I would say, like,

you know, we should always focus on automating things. This was true five years back, but this is more true with AI and ML, you know, catching up now. But, the biggest

advantage that we saw from the operational aspect is, once we had this platform, we thought about, hey. You know, all everybody in the team, if you are repeating this task manually for more than three or four times, what can we do to automate it? So we need to have, having that autumn building automation around a lot of things has really helped us.

So if I take if I can take, like, few minutes to explain this, multiple things. Okay. Initial okay. We had this infrastructure that is good. We had monitoring everything set up. Now, my engineers are spending at least four or five requests every week on scaling clusters.

Then we start how can we automate it. So we started building, you know, scalable infrastructure where click of a button, I can go from six notes to 16 notes and come back, you know, whenever the marketing events and things like that are done and stuff manually doing things. Everything that we were doing every week, we were thinking, hey. What can we automate this week? What can we automate next week? That way, we automated scaling. We automated,

you know, instant resolution. That is one of the great things that we did, where, let's say, you are getting disk space alerts. K? If you page me at 02:00 in the night, that disk is filling up what we will do. You'll go and expand the EBS. So we thought, why can't we automate it? So we used, AWS EventBridge and other services and built automation around these incidents. Like, if I get a page,

the bot will get the page first, and the bot will see, okay. This is a similar pattern of a page that I can automate. And then based on the automation scripts that we provide, it will go and fix that page, like expand the EDS volume.

Or if a node is down, if a if and 60 node cluster, like Cassandra node is down due to whatever reason, you can just immediately bring it up, watch for five minutes, and if everything is okay, automatically close the page instead of paging a real person. So by implementing this automation mindset for everything from incident resolution, scaling to, you know, all the aspects of management,

we were able to, you know, save a lot. Like, we reduced 40% of our pages. Today, at least 40% of our pages are handled by bot stuff, a real DBA. Obviously, we'll have a report next day and see, okay, these are the pages that the bot handled. What can we do to enhance it more? So traditionally, DBAs were more setting up everything,

traditional mindset with manually and all of that. But, I think right now, even the DBAs, they need to focus on learning Python. They need to focus on learning different AI ML trends and analyzing what they can automate and reduce the manual workload. That is the biggest thing that I have learned over over the years. Like, automation first.

And also knowing when automation is not feasible and you need to just get something done because there's the famous x k c d comic of the, amount of time that you think it's going to take to automate something and then the amount of time it would take to just do it manually, and they're not always in line with each other.

Yeah. Yeah. Definitely. That's why, like, every week, we think, hey. Are these repetitive tasks more than three times that you are doing? So if it is a one off thing, that's no point. Definitely. But if you think that, okay, this is something that I will use four more times, definitely have, a script around it.

And so for people who are looking to invest in their own platform capacity or they're looking to invest more into nonrelational engines, what are the cases where you would advise against either or both of those approaches? If you're a small company with, you know, set number of teams, I don't think investing in the platform helps. Like, just have your DevOps engineers focus on best practices and things like that. But if you are a large company with,

and also, like, if you're not building infrastructure every day. Okay? Let's say it's, I set up infrastructure for the next six months. I build my use case. I'm going to the market and it's sustainable. It's okay. You don't need to build a platform. But if you're constantly creating infrastructure and scaling and,

you know, evolving and you're a large or at least a midsize company, I would say you have to invest in a platform mindset. But if it's you're a very small company, a start up, I I I think it's an overhead. And,

I think the same thing with the NoSQL technologies. So, like, I mean, the advantage you have if you are starting something new is you have an option. Like, okay, you pick you go and pick the best thing for it. But if you are a legacy company, you are still in a lot of tech debt on old legacy stuff, you need to obviously do the return of investment analysis on what it takes to migrate and does it add any additional value.

And as you continue to invest in these automation capabilities, managing the scale and variety of offerings that you are supporting, what are some of the things you have planned for the near to medium term or any new technologies or new capabilities that you're looking to invest in? Right now, we are doing a lot of work on, AML stuff around data infrastructure as well where we can if we have an incident, okay, it has to go and scrape all the previous incidents,

come up with an action plan, and tell me, okay. This is your problem. It happened this day. This is what you did. Quickly fix it. That's one example. And, automated scaling, where right now we go and, let's say we have a marketing event and things like that. We say expand the cluster from six to 60 nodes. The process of automation, expansion or the scaling of the cluster is all automated, but the decision to automate is still manual. So we want to also use AML to make the decision.

Whenever it's seeing some particular trends, it can auto scale and, you know, come back to a original capacity later. We are also investing a lot on in in addition to the data infrastructure, we also want to expand it to the the app layer where you have a use case. You need, like, 10 app servers or, like, 10 data infrastructure servers and all the aspect of it. We want to tie everything into a pod instead of just the data, portion of it.

Are there any other aspects of the work that you're doing, the overall approach of platform engineering for building these database as a service capabilities or the specific engineering challenges that you've been tackling in the process of building out that capacity that we didn't discuss yet that you'd like to cover before we close out the show? Few things. It's like I think I want to emphasize again that, you know, platform engineering is not a replacement for DevOps.

It's mostly to enhance and amplify the DevOps culture. And if done right, it gives a lot of, self-service capabilities to the developers. It reduces their complexity, standardizes a lot of tools and workflows within the company. So it's a great investment if you're a midsize or a large size company.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think there is a difficulty with, real time enforcement. Like, as the data moves fast, right, especially with streaming,

the ability to govern and manage it in real time becomes really crucial. So many traditional and, many traditional governance and management tools are not built for that speed. So, that that's a big gap for as as for me, it's like real time enforcement of data with governance and everything on streaming side is a gap still a gap.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've done and your experiences of building out this platform capability for databases as a service at your organization and some of the ways that you've addressed those challenges of scale and organizational

buy in. It's definitely a very interesting problem space and definitely an important one as you scale your capabilities and scale the organizational complexity. So I appreciate the time and energy you're putting into that and for you, sharing your insights, and I hope you enjoy the rest of your day. Pleasure, it is mine. Thanks. Thank you for listening, and don't forget to check out our other shows.

Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email [email protected]

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and co coworkers.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android
Open in Metacast