Data Quality and Gen AI with Lior Gavish and Kent Graziano

00:00

I guess what I'll kind of call out here is it's hard to productize generative AI. It's not a trivial matter, right? And I think a lot of us, Monte Carlo included, started from the block and tackle stuff. Let's take stuff that generative AI is pretty good at, and maybe code generation is something that generative AI does fairly well out of the box, and let's plug it into our applications where we need to perform those tasks, right?

00:29

And we'll use the foundation models that OpenAI and others have kindly built for us through their APIs, and boom, you have generative AI in production. Hello and welcome to Coffee with Coalesce, a monthly podcast about all things data and the trends and technology transforming our industry. I'm Armand Petrosyan, CEO of Coalesce, and here with me is my co-founder and CTO Satish Jayanthi. Together, we'll be your host for the next hour. Hello, everybody.

01:05

I'm super excited to have a great guest here, Mr. Lior. I think, Kent, you've been on here numerous times in the past, so maybe a quick introduction from you, Kent, and then our centerpiece guest here with Lior will let you fill in as well, and then Satish. But, Kent, why don't you go first? Do you want to just give everybody a quick background intro on your end? Sure. Thanks. For the folks who don't know me, I'm Kent Graziano. I'm known as the Data Warrior.

01:35

I have been in the data space for multiple decades, something like 30 or 40 years. I can't remember anymore. I had the pleasure of working with Armand and Satish for over a decade through numerous other companies specialized in data architecture, data warehousing, transformations, Data Vault in particular. I was the chief technical evangelist at Snowflake for six years, and then retired and had people like Armand saying, well, no, you can't retire, dude.

02:11

We need you as an advisor, so I spend most of my time these days as a strategic advisor for basically a lot of folks in the Snowflake ecosystem. I do have my own podcast that I do called the True Data Ops podcast, and Lior's Better Half was on that. That's awesome. In the last year, I got to talk to her about Monte Carlo and all things in that range. That's awesome. That's awesome. Hey, Bram, cool, Kent, and you mentioned it, the Better Half.

02:45

I oftentimes call Satish my Better Half as far as the co-founder conversation goes, but Lior, this is actually your Better Half. You're married to Barr, the CEO, the co-founder. You co-founded the company together. Can you give a quick background yourself and also just for all the people that are tuning in right now, I would love to hear a founding story. Just talk us through how you both decided to start Monte Carlo, start the company.

03:12

Anything you'd like to share on your end would be awesome to hear. For me personally, and I think everybody else, that's on as well. Yeah, absolutely. Thanks for having me. It's so fun to be with such a great group of people in data. I have Barr as my Better Half both at home and at work. It makes things easy. My background, I'm a software engineer by training. By trade, I'm probably some mix of software engineer, data engineer, data scientist.

03:48

I've become the jack of all trades because I've always worked in startups and was excited about things that you can do with data. I started working on machine learning projects. I'm worried about disclosing how long ago, but the story that started Monte Carlo actually started with my previous startup, which was in the cybersecurity space. We basically used data and analytics to help companies manage and control their data, especially their sensitive data.

04:23

We got acquired at some point by a larger cybersecurity firm called Barracuda, which is where I spent, I think over three years, led the engineering team there. One of the things that I spent the most time on was actually building machine learning models so that we can help our customers identify types of fraud that are difficult to identify with rule-based systems, which is what the product became extremely successful.

04:51

I think it was the fastest growing product that Barracuda had since its inception. You got quite a bit of adoption and helped, I think, millions of people. The flip side of that is that when I was thinking about the times that we disappointed our customers, it was primarily when our data was wrong, when something in our pipelines that feed our models and or feed the features that are consumed by those models broke.

05:23

In a way, that was far more dominant in creating issues and frustrations for our customers than what my software engineer self would consider traditional downtime, like your code is broken or your infrastructure isn't running fast enough or those kinds of things. When you think about this, you realize that on the software engineering side of the house, this methodology of how to make things reliable has existed for a while. We generally call it DevOps today.

05:55

It's something that people have been practicing for decades, essentially. There's a very good understanding of how you do this, what is the process, and what are the tools that you need in order to do that, starting from... There's hundreds of different tools, but starting from a CI, CD system and all the way up to observability and monitoring things production, all of that to a large extent did not exist on the data side of the house.

06:26

All the stuff that we built in terms of data pipelines and machine learning models that run real time had almost none of that. Even worse, it wasn't even clear how to do this, like what is the process by which you create reliable data and reliable models. That's a little bit where my inspiration for Monte Carlo came from. Mar, independently, was working more on the analytics side of the world.

06:54

She was leading operations in an enterprise SaaS company that basically helped customer success teams use data to act on churn and upsells and things like that. Were you two together at the time or no? Did you meet later after this? Oh, yeah. No, we weren't married. We've been married for... Okay. Yeah. Way before Monte Carlo existed. Yeah, yeah. Got it. Okay. So, marriage came first.

07:21

And what happened was, Bar left her job and I was helping her after hours, like over the weekends and nights, because she was thinking about starting a company. I'll keep her story shorter, but basically she ran into a similar set of challenges in the analytics world. So I had unreliable machine learning models. She had unreliable dashboards that caused customer frustration, et cetera, and it clicked for us that I was helping her again as a supporting husband.

07:58

And Bar kind of thought, oh, it was interesting. She started researching the space and trying to understand whether we just suck at our jobs or whether it's something that a lot of people experience. And she did find out that this is a common problem that everybody that's building data is experiencing pretty much. And it was kind of like a cue, oh, it might be interesting to go out and solve this because this is a very important problem.

08:25

And the use of data, the use of machine learning and production was definitely on the rise, which still holds true today. And I wasn't actually planning to join forces and work with her, but a mutual friend of ours that actually was working at Snowflake at the time, still works there today, and Bar went to consult with him and get feedback about what she was working on. And he basically said, oh, you know, Leora has the perfect background.

08:52

He worked on fraud detection and data analytics, and he's cheap labor. So why don't we get into it? Yeah, it doesn't get cheaper than free. I would imagine you weren't charging her for this consulting. I see. It's nothing like a family-run business. I was not. The most I ever got was maybe, I don't know, help with the dishes or something like that. Yeah. So, so me, cheap labor was asked to join the team and Bar was very clever.

09:20

She got me to, she would go out there and talk to presumably customers, like future customers of this, trying to research the market and understand how people are tackling this and what the level of pain is and so on. And she cleverly invited me to join a few of those. And then, you know, it's pretty evident that if we were able to solve this problem, this would be very meaningful. Like this is something that people, you know, lose sleep over.

09:45

And this is something that would make for a fantastic company that could have a lot of impact. And that was, you know, and those are the type of opportunities that you only get a handful of times in your career. Maybe only once I can pass up and decided to join her full time, which I did several months later. And we started Monte Carlo to basically help companies deal with those sleepless nights and frustrated stakeholders.

10:13

And ended up actually at the end of the day, Monte Carlo takes a lot of the ideas that we all learned from, you know, Bar working on the analytics side and trying to operationalize it and me kind of, you know, having that DevOps discipline, we applied a lot of those ideas and again, both forming the methodology of how to create reliable data products. And obviously we were also excited to build the technology that supports it. And at the end of the day, Monte Carlo is an observability tool.

10:46

It's probably the equivalent of a data dog or a new relic. And, you know, in the same way they use those to review the, you know, the reliability of applications, of infrastructure, of security increasingly, you know, Monte Carlo is how you do that with in the data stack, right? With Snowflake and Looker and a million other tools that data people have adopted. And that's what we've been doing since and it's been an exciting journey so far. That's amazing.

11:17

Wow. Yeah. Sounds like a great journey for sure. I think. And we're still married by the way. Still married. Yeah. That's good. That's good. That's most important. Hopefully this is a unifying thing that you've got through together and you've clearly had a lot of success and congrats on that so far. We certainly know how it goes, at least to be co-founders, both Satish and I obviously co-founders here at Coalesce. But this sounds like a totally different ballgame. You're married to the person.

11:44

It's amazing. I got plenty more questions around all that, but that's such a good background and definitely familiar with the phase of not taking a paycheck, starting the company. Like both Satish and I did this for no money the first year, year and a half, I think it was Satish when we started Coalesce. But anyways, I'm CEO of the company Satish. I'll let you introduce yourself and then let me jump into a couple of questions I got for Lior and Kent.

12:11

Sure. Hey guys, Satish Jayanthi, CTO, co-founder of Coalesce. My background is basically before Armand and I started working together, I was on the other side actually making those problems that Lior, you were alluding to, which is building pipelines, but then you have data issues. But essentially being on the engineering side, managing and data teams, solving business problems for large enterprises. That was what I was doing.

12:47

Cool. So you certainly experienced some of the issues that Monte Carlo aims to solve. I'm curious, were there any specific use cases, especially because from my understanding, Monte Carlo was either the first or one of the first pure play data observability products in the market, right? As the modern data stack expanded, you saw all these different solutions appear for specific issues as data has become democratized and just so much more common. And so was there specific use cases?

13:18

You mentioned fraud detection was one that you were exposed to firsthand. Was that the initial beachhead use case that you were looking at when you approached starting the company? What were the first couple of things where you were like, okay, we definitely need to solve this right out of the gates? Yeah, great question.

13:36

Putting aside my speculations back then from five years ago, I think probably the biggest surprise to me starting Monte Carlo and then kind of living through it was that it is not very use case specific, right? I thought a lot of our customers were going to be essentially tech companies using data for fraud detection and other places where data matters.

14:02

What we learned though, it's quite incredible, is that every single industry you can think about is using data today and using data in a meaningful way. And so our customers ended up being from, and this is from the early days even. So of course, all the prime suspects are there, right? Like you'll find tech companies, you'll find e-commerce, financial tech companies, right? All these are there. But you will also find a lot of manufacturing and a lot of education.

14:37

And pretty much any sector of the economy that you could possibly imagine is using data in a meaningful way. Kent, you probably saw that. Yeah, no, I'm just thinking through all the companies I've worked for over the years and really the whole, what we now call observability, been a, like you said, a problem just in the analytics space, which is where Barth came out of, is there was always those questions about, can I trust the data in this dashboard or in this chart?

15:03

How come I'm getting two different customer accounts from these two different managers out of our data warehouse, right? It was like, where's that data coming from? And trying to prove to the CEO, somebody having to go through piles and piles of hand coded ETL code to go, well, how did we end up there over in this smart? And we got a different answer over in that smart. Yeah, it is every industry.

15:28

I mean, starting in the mid nineties with data warehousing really starting to boom back then, I saw that thing and said, this idea of business intelligence slightly.

15:38

Okay, at the time, some of us thought it was an oxymoron, granted, can we actually have intelligence in business, but everybody needed that data and you could see anybody who's going to be successful regardless of the industry, like you said, they need to use the data effectively, but you get down to the data governance and the reliability, the auditability and the overall trust factor of that data.

16:06

The more important the data became to a company, the more important all of those things became, right? We've got to be able to trust that data. And now that we're moving into AI and gen AI and your experience in machine learning, it's even more important. It's like, how do you trust the results of a black box gen AI thing if you can't trust the data that went into it? That makes complete sense.

16:33

We talk about this all the time, especially with the black box in the AI world, the foundation you're feeding it with is so critical. Real quick, and maybe this is for everybody here, but when we think about data observability, some of this feels like it is on the fringes of data quality as well, because we talk about making sure that that quality is high. I guess, Leor, just for the audience here, how would you decipher or compare the two?

17:00

Is it completely a separate thing or do you see observability is related to quality? What are your thoughts there? And also, it looks like as people are tuning in here, if you have any questions, feel free to ask for anybody that's on the webcast right now. But yeah, can you help decipher that? Yeah, to me, and there's obviously different ideas about this, but the way I view it is data observability is an extension of data quality.

17:28

I think a lot of data quality, both the concepts and the tooling around it came from this viewpoint of I'm going to do something once very manually. I'm going to take data and ingest it, clean it, transform it, and put it in the binder. That's where the methodology came from. It's very much oriented towards point of ingestion or very, very specific parts of the pipeline. It's very focused on exclusively the data and the rows. That's critical. That's building blocks.

18:14

You can't get a reliable dashboard if the numbers are broken or if the data that was ingested has values that shouldn't be there. That's absolutely a critical thing. It's a big part of data observability, a big part of what we do today. The thing where data observability took it a step further was, hey, look, we're not putting data in binders anymore. We're not doing this pool once a quarter that we analyze and scrutinize and have a person look at and manually transform.

18:46

We're not doing that anymore. In a modern company, there could be hundreds and thousands of people that are staring at dashboards every day to do their jobs. There's models that are making decisions on behalf of the business every single day or billions of times a day sometimes. That idea no longer works. You have to think about how do you scale this thing?

19:09

How do you make sure the entire system, all the way from the data that gets ingested and not through typically dozens and hundreds of steps of transformation, all the way down to the end product, be it a dashboard or a model or whatnot, how do you make sure this whole thing works reliably? One part of it is definitely making sure that values are correct in a sense or meet certain business rules.

19:37

But you really have to start thinking about how every single step of the way of this long pipeline, how reliable is it? How healthy is it? And you have to think about it several dimensions. These systems are pretty complicated. They have both the data that's flowing in, like there's this external input that you're taking from either another team in your own company or sometimes from an external source that can change unexpectedly in ways that you don't anticipate.

20:05

You have the code that you're using to transform all that data. You're usually, again, applying at least several dozens of steps of calculation. And that can change. You're hiring people to build those pipelines, to make those pipelines better. They're going to change the code. The code is going to have unintended consequence, like it just happened. And then the third piece, of course, is infrastructure. All this thing is running in a variety of tools.

20:33

And all these tools work and combine together in sometimes mysterious ways to create the end product. And you have to understand how reliably and how healthy all these things are working and how they're combining to create the final result. And that's probably the biggest philosophical difference in terms of observability, like for data observability and where it extends data quality.

21:01

In practice, this means that a data observability solution will give you tools that allow you to look at all of the tables that you have, not just at the point of ingestion or consumption, but all of the different steps of the pipeline. And it will try to measure health at every single step. And it will try to give you meaningful alerts. And it will, even more important, it will give you meaningful context about those alerts. Like, OK, there's a problem here.

21:28

The data is wrong one way or another. Where is this data coming from? What happened there? Did someone change the code there? Did someone or the data that you ingest change in some way that you didn't expect? Did the infrastructure that was running all of this have a certain issue, performance or errors or otherwise? All these things are combined together into the single pane of glass that gives you visibility into data quality and other things that are important for the...

22:00

So this is really, I'll say, automating the overall monitoring of what's happening in the data ecosystem, right? There's just no way to scale without doing the automation. Right, right, right, right. And that's exactly where data quality, quote unquote, struggled in the past, right? Yeah, running one SQL script occasionally, like before you move something to production, OK, so the code works today on the data we're looking at today.

22:27

But like you said, something changes in a source system or a rule changes, somebody builds this pipeline a little different, and now you've got data flowing into tables that, is it really still right? Uh-huh, yeah, yeah, yeah. And these things break, right? Like they do. It's the nature of complex systems, right? Yeah, yeah, that's helpful. I love that.

22:52

And I like the way you just had a high level comparing it to something like Datadog as far as software engineering goes, but appropriating that to the data pipelines, people go through building them and managing them. Any questions from you that you're curious about? Well, you know, it's the age of AI here, and since Lior has got the background in machine learning and all that, I was really kind of curious as to what are you seeing with your customer base today?

23:19

Are people really getting into gen AI and machine learning? Are they just dipping their toes in? And how many are getting past these sort of, well, let's try it out with chat GPT sort of thing and experiment and really looking at putting stuff into production and using your product as a way of monitoring those pipelines to make sure that everything's good. Yeah, great question.

23:43

It's been really hard not to hear about generative AI in the last, I've been trying sometimes to not hear about it, you still do, right? And so I think the kind of like you called out, I want to say that, you know, 80 or 90% of teams that you talk to have plans around it. And I've taken at least some steps to experiment with it, to understand it and to do things with it.

24:11

But you also call out there's a pretty broad range of maturities around that, where some teams have gotten all the way up to, you know, a customer facing production app that leverages gen AI and a lot of companies are still the phases of figuring out what to even do with this and how. And we see it across the board. And I guess what I'll kind of call out here is it's hard to productize generative AI. It's not a trivial matter, right?

24:43

And I think a lot of us, Monte Carlo included, started from the block and tackle stuff. Let's take stuff that generative AI is pretty good at, you know, and maybe code generation is something that gen AI does fairly well out of the box. And let's plug it into our applications where we need to perform those tasks, right? And we'll use the foundation models that OpenAI and others have kindly built for us through their APIs. And boom, you know, you have generative AI in production.

25:15

And that is probably the most common, you know, success we've seen. And there's a good number of companies that have been able to do that. Monte Carlo is one of them, by the way. Like we do use generative AI in our product and a number of use cases. And it's gotten good adoption and good feedback. And can you talk about that a little bit? So like, as you mentioned, it's hard to, it's difficult to productize.

25:38

Like when you think about Monte Carlo as a product, leveraging gen AI to impact your customers in some different ways, like what are some of the use cases that you saw were low hanging fruit or opportunities to leverage LLMs when it comes to your value proposition? So in our world, and I'm also happy to share examples outside of data observability, but in our world, code generation is probably big, right? Like we do help our customers deal with code in various ways, right?

26:11

Especially SQL most commonly, but you know, we do help our customers process logs from their data warehouse, for example. Makes sense. Our customers do use logs to, sorry, use SQL queries to basically define quality rules or about their, you know, the data that they have. And so we've found different ways to help them create that code, debug it, optimize it, things like that.

26:40

And so all of that is, is built into Monte Carlo and, and that does have been some of the first, you know, first implementations. Yeah, that, that, that, that worked pretty well. We've seen similar patterns with our customers specifically around usually code generation and or summarization is also something that works pretty well. So how accurate is it usually like we we've gone through internal PSCs at coalesce as well.

27:10

And like, for example, like creating a join or something using GenAI for doing something like that. What we found is it gets you pretty close. It's not a hundred percent accurate. Is that similar to the experience you've had as well with? Yeah, absolutely. It's a copilot. Right, right. Isn't that what Snowflake called theirs? Isn't that the Snowflake copilot? Yeah, there's like core tech. So we'll be doing a demo with Doug and Snowflake later this month. That's for like a different theme.

27:29

We've got demos with Doug. Super fun. If you haven't tuned into one of those, definitely check it out. Doug's amazing. But Snowflake is going to be coming on to the next one. So if you haven't, you can check it out. It's a great, great, great, great, great, great, great, great, great, great, great, great, great, great, great, great, great, great, great, coming on to talk through cortex, which is a lot of their gen AI copilot functionality, but similar themes.

27:55

So it sounds like for you, we or that that is that has been like the first use case that you saw as an opportunity for and that and that's helped the customers like help customers at least cut down on some of the time that they would have maybe brings down the skill skill set required a bit for anybody. It saves a ton of time right even for a for an experienced

28:15

engineer writing in SQL is is tedious right. I think where we saw though the most you know the biggest jump in functionality is when we started so one thing is getting the user experience right and the right user experience for generative AI is not not surprisingly but an interactive experience right. If you just try try to throw answers at people they will get limited value because of what you mentioned like how you know how close is it um does it need to refine and tuning doesn't need

28:50

more context. I think the other part that really made this so much better for our customers start getting a lot more more success with it is when we started incorporating proprietary proprietary information into the process right and you can think about it as a very in our case a simple version of RAG of retrieval assisted generation right.

29:14

Then when we started augmenting the information that the user provides about what they want to do with information that we already have about the the in this particular case the the user's data ecosystem starting from simple things like what tables did it even have and what columns did those

29:33

people have and what do we know about right and then you can get more advanced. The results are so much better and more and more more personalized in a sense and it also creates a more differentiated differentiated experience if you will compared to going to chat GPT right because if we just use the APIs as they are I mean it maybe saves you a couple seconds of going to another tab but we wouldn't be offering anything that is really better than going to chatgpt.com or whatever the

30:05

URL is I forget and so where it really became nice for our customers is when we started incorporating our proprietary information not proprietary but information that we have about the customer into that experience into the model. I was gonna say Satish talks about this pretty often too like it feels like it's a race to be able to train the model itself and like that's really where the value

30:29

is versus just some like public API. It's about the metadata right or RAG right and and personally I think RAG is the easier thing to do in a sense but yeah you I think the point is correct.

30:46

Satish I'd love to hear your thoughts as well but like it's really a hard or the secret sauce here is the data that you have however you choose to incorporate it into the model into the application RAG or fine tuning or you know train your own foundation model if you really want to it's hard but you you kind of have to do that to make generative AI effective otherwise you're nothing

31:09

but a wrapper for gpt right. Satish anything any thoughts on that just like I know you obviously built some of the world's largest most complex data warehouses you've probably seen these problems over and over again as far as it relates to gen.ai data observability yeah training models yeah.

31:28

And I think you know implementing the augmentation piece is probably the easiest of all the solutions out there to improve and add value as opposed to the generic API that's available so that's for sure because the training is while everybody says training and fine tuning and training and fine tuning is not that easy so that's that's that but as far as the observability goes I have a question for you Leo so you know when we implemented data quality in my past life you know just like we

32:05

discussed hey you write in some sequel to test something either at the beginning of the pipeline in the middle of the pipeline or at the end of the pipeline and then we say hey here's my first rule here's my second rule third rule and once you get to a dozen rules then you're kind of getting to a point where you're losing control of what is happening and you don't have a proper structure so my question to you is if these companies are getting started let's say on data observability

32:36

what would be some of the things that they need to be you know thinking about from the best practices or how do they start I mean obviously you don't want to boil the ocean but what's the best way to kind of get started with observability for these with observability great question and very relevant to general AI as well I think the way to do it is in my opinion several things first part is leverage automation right like you can absolutely go ahead and write a lot of rules like

33:09

you called out that gets really complicated really quickly because it's hard to anticipate all the things are going to break it's hard to manage the configuration and thresholds and whatnot and can create a lot of noise as a result that would then alert fatigue people and and make the whole initiative fail and so the first thing is to leverage automation right is a lot of the stuff that we built into Monte Carlo is this ability to basically automatically collect a lot of health

33:38

metrics about the data starting at the pipeline level or table level things like you know how recently the table was updated or you know how many rows it has does it kind of does it make sense for it to have as many rows as it does today and then going sometimes many levels deeper into the data itself you know again starting from the basic stuff like you know how many nodes you might have in a particular field or how many unique values and then going down into as sophisticated of a

34:09

metric as you want to measure the health of your particular you know data set in your particular business and we build a lot of tools to make that scalable right whether it's the ability to collect a lot of those metrics you know with a single click or a single line of configuration like allows you to to do a lot of these metrics across a lot of tables at once whether it's machine learning models that help set thresholds in a way that may not be perfect or exactly what a human with a lot

34:39

of context would have but they're very good sanity checks and they will put a lot of confidence into the pipeline without having human goes and and manually sets up a lot of rules so I think that's the first piece of it and and that's the the second piece of it is rules have a very important place right like and our customers build a lot of those various forms of rules the trick is to also create the the operational discipline around it I think right making sure that it's clear a what needs to

35:13

be monitored right mapping hey you know there are these critical products data products that I'm trying to make reliable right whether it's the dashboard that the ceo uses or you know the table that feeds into my latest and greatest generative AI application but I need to understand what what it is that needs to be reliable what the slas are I understand what feeds into that what are the breaking points I understand how to monitor for those issues and make sure these

35:47

things go to the right people right because I could send all of my alerts to one single channel and hope that something happens what happens is that everybody ignores that channel right we need to do is make sure it goes to the right person at the right time for them to act on it and so you need the you know both the the organizational discipline and the tooling again to to make sure all this is possible and attack it in kind of a methodical and measurable way right that's the other part

36:18

you're a small team it may not be a big issue but like some of our customers are you know have thousands of people building data stuff things with right early thousands and it gets out of hand very quickly like way before you get the thousands of developers and so you need to start measuring like well how reliable are different you know data products and how rigorous are different teams and in in in managing reliability and doing the operational stuff and you need the visibility

36:48

into that in order to drive accountability and and and eventually trust right so those are probably like the three key elements I'd say in terms of actually rolling it out in a in a sizable company so you got to have the right people and processes agreed upon in addition to having the technology right otherwise it's it's it becomes chaos and I always come back to this doesn't it I know people you know because you like you said you got to have agreement otherwise otherwise

37:21

it's you know people are not going to be like you said again overloaded by the alerts well if they're getting over by the alerts then we picked either the wrong people or the wrong process or both and how we're going to take advantage of all of this information right it's like the technology you guys have done that right you got the technology now it's like how do we deploy it properly and then we get talking about the the culture the data culture right the uh and data

37:49

literacy you know are are we getting it to the right people that understand what that alert even means right yeah yeah hey there's a quick question in the comment section James Daly here kind of ties into where we're going to close it out we've got a few minutes left so I just want to make sure we got this question asked and I remember you James we worked together in a past life so love seeing you on here but the questions around the the merging of proprietary information to augment

38:15

ai can be a tricky topic for some organizations customers need safeguards that their proprietary information is not used to train models shared with the general public interested to hear how Monte Carlo manages that fear any thoughts there and like do we see this as a theme coming up in 2024 for for companies as they express that towards vendors like ourselves yeah such a great question first of all I'll admit like Monte Carlo doesn't necessarily manage this directly and so

38:45

we're usually not but I but I do have a point of view on that and I think as you call out James to adopt generative ai with proprietary data in an enterprise there's a bunch of requirements right that goes beyond the like oh a demo that I'm going to post on x or twitter or whatever it's called you need you need to start thinking about like what data goes where right which is what what you're you're pointing out around privacy compliance data security things like that you

39:16

need to think about these things and strictly you need to think about scale right isn't going to you know it's one thing for me to build a demo out that works on my computer but like how do I do that in a way that serves my customers which could be many and they might not be talking to me or in a tightly controlled environment and you need to think about trust right like how do you make this thing you know produce the right results and and and that ties into data quality data

39:41

observability and some things we discussed today to answer your question specifically about security but the same but I argue the same argument would apply to to the other two this is a place where this partially makes me say that RAG is probably the architecture of choice it is not only easier to implement from a technical standpoint but it also very naturally lends itself to all of these questions right you're ahead of trying your own model like there is literally I'm not aware of

40:17

any way you can control who gets what data right like right you can then block the model from talking to certain people talking to certain people but but whoever gets access to the model gets access to all the data that was fed into it essentially plus some hallucinations but that's a different story in a RAG model you can actually control that much more tightly right like we've solved the the the issues around data privacy in the database world right we've been doing it for

40:46

a while we have a lot of good constructs there snowflake has a lot of good controls you know other solutions too and so you can actually if you go down the RAG route and RAG I can actually solve a lot of the problems that we want to solve the generative AI you can actually control security and privacy very well with something that already exists today right it's not like me talking about some futuristic capabilities you can actually make sure that the person that is using the app

41:17

gets access to exactly the data set that they should be getting access to according to their role or user ID or you know whatever it is and that is a very effective way to do that if you're going into the fine tuning and training world which which has some merit obviously in a lot of use cases then that's a whole other ballgame right that's about you know managing different models for different people and that that could be really really hard to do at scale for example like

41:43

you know if you have millions of users some some of our customers do or tens of millions you know you can't maintain a model for you or it's very very difficult to maintain a model that is custom trained and custom built for every single customer like you kind of have to do it with RAG and you and you can use the security and privacy controls that have been created around databases to really accomplish that same objective with with the general yeah it basically helped

42:11

basically help reinforce yeah because technically typically they use a vector database or something to hold this right you know information and that has the all the database permissions and policies that you can use most of the people are familiar with that type of security mechanisms anyways yeah yeah cool uh well we're a little bit past time i could stay on this all day leo it's been awesome having you as a guest kent as well it's always a pleasure thanks everybody for hopping in and i'm

42:41

looking forward to the next one of these but it's uh it's been awesome having you on both of you we are kent thank you so much and thanks everybody for the great questions and uh for jumping on with us

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript