Relational Foundation Models for Enterprise Data with Jure Leskovec - #768 | The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

⁠¶ Introduction to Relational Foundation Models

00:00

The recent breakthrough uh that we had and we just released um the second version uh is our what we call a relational foundation model. Um and that's uh Foundation model that can reason over structured relational data. And it's crazy what this model can do. It can make accurate On any database and any predictive task.

00:30

🎵 Music

00:45

Alright everyone.

00:47

Welcome to another episode of the Twimble AI Podcast. I am your host, Sam Charrington. Today I'm joined by Yure Leskovitz. Yure is co-founder and chief scientist at Kumo and a professor at Stanford University. Before we get going, be sure to hit that subscribe button wherever you're listening to today's show. Yuri, welcome to the podcast. It's great to finally connect with you.

01:11

Yeah, great to be here.

01:12

I'm looking forward to our chat. We're going to be digging into your work on relational learning, um, as well as some of the other interesting things you're up to at Stanford and and around AI for science. and more. Uh but let's start there. Tell us a little bit about your research photos.

01:33

Uh yeah, great. So uh I'm Professor at Stanford here in the computer science department. Uh you know, where the future happens, I like to say. Um so

⁠¶ AI for Science and AI Virtual Cell

01:43

There's always exciting research uh going on. Our uh uh focus recently has been I would say on two areas. First is uh AI for science, uh and in particular in uh we have a project that we call AI Virtual Cell.

01:57

where we are basically building next generation foundation models that allow us to represent human cells, patients, as well as individual molecules in cells and allow us to re to reason um across this complex biomedical data for, you know, discovering new new cancer therapies, molecule design.

02:21

reasoning about all different biomedical data modalities, objects, and how they interact with each other to help speed up science. So it's everything from... foundation models at the lower level of understanding proteins to then models that aggregate the um let's say the molecules in the cell to represent a single cell, and then the next level models that now say, oh, you know, a tissue or a patient is a collection of cells, cells are collection of molecules.

02:55

Let's build models that just aggregate all this knowledge in a very faithful representation, let's say, of a of a patient. Um and that helps uh a lot because now the representations we have are much more uh robust.

03:08

03:09

driven purely from the data, no biology in some sense is inserted in the model. Everything is emergent out of the data. And it's amazing how much we can we can learn from that. So that I would say is one line of one line of work we've been working on.

⁠¶ Training the AI Virtual Cell

03:24

And and I I can't help but hit pause and ask, like, do you train this all end to end or are you training an individual model or or representation at a time and then aggregating it after you've got these these models defined?

03:39

Great question. So the way we are doing it right now is actually the first scientific question was, is this even possible? Right? Could you say a cell is a representation of molecules that are inside the cell? So now let's say molecules inside the cell are the proteins.

03:55

I can use the protein language ball to now represent every protein in the cell. And now the cell needs to aggregate information from all these po proteins to say, I am a cell, this is my Right, now that you have a representation of the cell, you can build let's say a patient level model that says a patient is a collection of cells in given state. that are composed from the proteins that are in there and are we able to kind of collect this information over this?

04:23

orders of magnitude different scales to get a strong data-driven representation of the underlying patient in this example. And the interesting thing is that this is purely doable. and it's uh trained purely in an unsupervised, uh self-supervised way, right? So you don't need to insert any human bias, any human knowledge of biology. The biology emerges From the data itself, right? Like cell types, cell states, relationships between them.

04:58

That kind of human biology, how we describe it, actually emerges directly from the data, right? So the model learns how to best describe the underlying processes and phenomena without us. Pushing it on it from the top. That's kind of the exciting, uh interesting, kind of emergent capability there.

05:18

And tell me if this question makes sense. I think it's related to um the the way you're describing the the training process but Is the data set that you're training on mechanistic in nature or behavioral in nature? In the sense of like are you observing some behaviors of cells and then training on that data and they're, you know some kind of faithful representation of mechanisms are emergent or is the does the data have mechanistic properties?

⁠¶ Single-Cell RNA-Seq Data

05:49

The data we are using in this case case is called single cell RNA seq data. Uh this is data that uh large international consortia are are are collecting. But basically what it says is that you can take some sample from some, let's say some tissue

06:04

uh and then for every cell in that sample you measure the the number of different protein molecules inside that cell. So every cell is now represented by a twenty thousand dimensional vector that tells me the abundance of that specific protein in that specific cell. Right, and every cell has different

06:37

So that's the that's the raw input data. And then of course because we know what the protein is, we can actually bring the protein information through through um ESM or through alpha fold, uh and now it gets and then it gets very

06:50

So being protein based, that brings in both mechanism and behavior.

06:55

Exactly, exactly, exactly. And then of course you can connect this all the way to the phenotype because what we are doing now, you know, is um we can take a single drop of blood from a patient. And rather than drunken kind of a klassic blad screen, we can do the single cell RNA seq analysis. So now we basically can profile every single Cell inside a drop of blood. And why blood is interesting is because it circulates through the entire body, it kind of captures the state.

07:23

uh the immune state of the of the entire body. So we are able to uh detect diseases, uh understand patient trajectories and things like that just from this digital twin of a single drop of blood.

07:37

Super interesting. Also very different from the other thing that you focus on, which is relational data.

⁠¶ Unifying AI for Science and Relational Data

07:44

Yeah. Uh let me tell you uh let me tell you a story why this is not so different. Okay. Uh okay, so so what I'm really you know what I'm excited about kind of fundamentally or how do I approach things is to always kind of take them apart and understand how different parts interact and how different parts work together.

08:04

Um and you know, where I started doing this was actually in a in a in a third domain, which is um uh computational social science. Right. I was I was very excited about how do people interact with each other. And when I started my research career, kind of social media just just started as a phenomenon. And and and my view at that point was like I can use social media as a telescope.

08:28

Υπότιτλοι AUTHORWAVE through using cell phones through using social media and so on and it's all about networks graphs of people interacting with each other How the virus will spread as we reopen the economy. If we increase the occupancy levels at different, you know, restaurants, gyms, uh churches, wh whatever the locations to say this is how the virus would spread, this is what you can do. And and that underlying was an So now what is biology?

09:27

And then, you know, what is what is a what's a tissue, right? It's again cells coming together, talking to each other, organizing in a given way, so that, you know, my skin has a given structure, it has given set of layers and so on. So that essentially a network a graph of interactions as well. And then, you know, you mentioned relational data, right? So data that sits in tables in a database that every enterprise in the world has and it's kind of the most valuable data.

09:54

That's also capturing a graph of interactions of different entities inside that organization.

⁠¶ Limitations of Traditional ML for Structured Data

10:01

When I look into some of the work you're doing around the relational deep learning other conversations I've had that focus on like deep learning for tabular data. uh but that tends to be focused on like that single table uh as opposed to these relationships that arise in uh you know in enterprise data where you've got uh you know different tables that are linked by keys and and whatnot. You know, talk a little bit about how th these two areas of of research and practice relate to one another.

10:40

Let me explain, right? So maybe first if if we think about uh machine learning. Right. It hasn't really changed over the last I would say thirty years. Maybe the No, not trivia, right? Like the you know, we have we we have We have this you know, we used to have I don't know, we had decision trees, then we had support vector machines, then people like logistic regression, then people are like, Oh, we'll build this deep neural network

11:10

Then we said, oh, we have gradient boosted trees, they are better, and things like that, right? But fundamentally, it has always been, you have your data, You feature engineer this single table of your features, you add a label, and now you train some supervised model that from the features predicts that label.

11:31

Right. And it we've been doing that over and over again. And maybe this predictive model, you know, it's a it's a deep model, we would call uh but um it's a neural network. But what I would argue is it that AI has not transformed this structured data space in the same way as uh computer vision or natural language understanding have been fundamentally transformed by AI.

11:57

Okay, and let let me let me quantify what do I mean by that, right? Like what was the big breakthrough, both in computer vision as well as in natural language? It was about let's build neural networks that learn directly on the road. I think in the old days, you would do, in computer vision, you would do all kinds of feature engineering, shift features, Gabor filters, and it'd be like, I'll describe this image as well as I can, so I can then predict, you know, is there a car on the image?

12:26

All right.

12:27

Um in in NLP was similar, right? Like we we you know, IBM uh won jeopardy. with their system. But it was all super hand-engineered, manual, and so on. But, you know, it worked, right? But it took 300 people to build. build it and it was great, but it was very kind of brittle. So again, the Transformers, they just learn over tokens. No, no, no, no grammar, no syntax, no, it's just, you know, learn over tokens, right? Again, a neural network directly on the road.

13:02

The same thing is actually not happening on structured tabular data. Right there, we don't learn on raw data. We run all these SQL queries, all this ETL, all this feature engineering to then come up with a set of signals from which we, let's say, try to predict. And when we came up with this idea of relational deep learning, our goal was to fundamentally disrupt this and say, hey, why can't I just learn directly over all relational data?

13:34

Um and why why do we always have to learn over a sing data in a single table?

⁠¶ Relational Deep Learning and Graph Representation

13:40

And the point is that as I take this multitabular data, and just to be very precise, what's a good example of this? It could be like, I have a set of customers. I have a set of products. So these are two tables. Each customer has an ID. Each product has an ID. And maybe I have a third table that's a set of transactions that says, Customer I did this, bought product I did that at this at this time for this product.

14:06

And that's a three-table, super simple schema. And of course, organizations have schemas of 50, 60 tables and more, depending on their complexity. So our question was, how could I just learn directly with a neural network over this multi-tabular thing? Um and the answer is, you know, and kind of surprisingly simple, is to say just think of the database, think of these tables as a graph of relationships between the entities in the database.

14:39

so this would mean inside in my you know i i'm a graph person so i like to think in terms of graphs right so graphs are composed of vertices the nodes this would be my users would be my Would be my products, would be my transactions, and so on. So this would be now the nodes. And then the connections are just saying.

15:00

This user ID was part of this transaction that was part of that product. And now we have a path from a user to the transaction to the product. And then, you know, another user or another transaction is another path in this very simplistic graph. Um and That we have a graph, we can basically apply graph deep learning, like graph neural networks, which is a way to generalize deep learning to graph structured data, and just train over that to get an accurate prediction.

⁠¶ Benefits of Relational Deep Learning

15:30

And what happens is two things happen. The first thing that happens is you don't have to do manual feature engineering. Right, so it's much faster, it requires much less effort to train these models. And the second thing that happens is your models are more accurate. And then you say, why can my models be more accurate? And the answer is

15:53

very similar to what happens in computer vision, right? If you are saying, I am a human, I know what a car is, so I will build perfect features that detect whether there is a car on the image or not. I know cars, I drive them, I'm such a car expert, I can build the best features for detecting cars. Nobody in the right mind claims that.

16:15

right but you know in in machine learning data science prediction people are still saying no i'm the domain expert i'll engineer Your features are just some arbitrary human-biased summary statistic of your data. right and a neural network that trains with gradient descent is able to do so much more nuance almost like feature discovery by basically attending over this graph to extract much more signal.

16:58

So we see this double-digit increases in model accuracy because the neural network is able to extract more signal out of the road. and i will just you know full transparency right if you are working on a super simple problem that falls on a line then no neural network is ever going to be better than a linear model.

17:26

Right. So what I'm basically trying to say, I cannot guarantee that always you will get better performance because sometimes the data is linear. And if you happen to train the linear model to it, you already have good performance. There's nothing more you can do. Right. So that's kind of the key idea behind relational deep learning is that now we can have neural networks just learn directly on the raw database data.

18:00

Don't need to build these manual feature pipelines and feature stores that are super painful and lead to so many different kind of bugs and inconsistencies and information leakage and time travel makes putting models in production super hard. Um rather just bring the raw data, have a neural network, and and and and have you know get get better results that way. That's kind of the the let's say the philosophy and the reasons why we are doing

⁠¶ Prediction Tasks and Use Cases

18:31

Can you give us some examples of the types of things you're trying to predict with these models? Are you trying to predict Things that are primarily about structure or are you trying to predict, you know, individual values? How do you think about what the models are capable of?

18:47

Uh that's a great question. The way I describe a framework right now, it's very generic in a sense that you can bring any set of tables, any set of connections between them, any set of columns. The underlying mathematical representation kind of remains the same. And of course the underlying graph changes, but the graph neural network or the graph transformer can be can be applied to that. So what would you wanna predict?

19:13

Uh depends on the data. If you have a transaction graph, for example, where we see uh great results in is on fraud, all kinds of fraud detection, anti money laundering. Account-level fraud, transaction-level fraud works beautifully, right? You just bring this heterogeneous multi tabular data together and just learn over it what fraud is. And fraud is interesting because it's so non-stationary.

19:39

You know, fraudsters are trying to game the system all the time. So as a as a machine learning engineer, you are always behind. Your model is always deteriorating. And you're like, okay, how do I design the next feature? How do I design the next? you have a neural network and you just pick the signal directly out of the raw data. So fraud is an example. fraud you can think of let's say as a classification uh task.

20:01

Uh then you can think a lot around uh regression type tasks, for example for customer behavior in terms of customer churn, next best action, um uh things like that. because it's predicting a link between the customer, the user, and the product. So we've seen great uses of this in recommender systems for ads, product recommendations, and things like that.

⁠¶ Addressing the Multi-Table Problem

20:43

And historically when I've talked to folks about deep learning, machine learning for tabular data, the results were um I don't know the best way to characterize this. Like I always get the impression that we're not quite there yet. And and Would you say the same is true for what you're doing, or is it an issue of like there was this missing link and that missing link is the graphical structure and now we have it and we're able to do much more? I'm trying to to kinda, you know, ground what you're

21:20

21:21

you know, this kind of broader results of applying, you know, th these techniques that have shown, you know, to be extremely effective with text and images to tabular data.

21:33

That's a I think y that's a great point, right? Like when we say tabular data and tabular machine learning, uh this is the community that works on single table pro. The data has already been flattened, pre-formatted, summarized to fit in a single table.

21:49

trying to get better results than we might get with You know, XD boost or something like that, right? Yeah.

22:06

HGBoost is still kind of the workhorse. Maybe on individual examples you can do better, but it's still the workhorse. And the reason I'm kind of less interested in this single table problem is because that's not the right problem to solve. I don't know any organization that has all their data in a single table.

22:31

Right. So the hard part and where the information gets lost is when you go from this rich relational structure into the single table. And once you are in a single table, you know, then we are you know, then we are kind of talking

22:45

It's almost like second order effects. Did you use this architecture, did you use that architecture, did you use this tabular model or this tabular foundation model or not? Right? Like All the information is there in that single table, and all the methods are about equally good at extracting. Right. I think where the difference happens is if you actually make a step back and say, Hey, single table model

23:09

uh or i i is not the hard is not the hard part. It's not where w is not in a sense uh general or realistic enough. Where you need to go. You need to go to the multi table setting. Because that's truly now the raw data you have. It's not some summarized, featurized data. It's the raw data. And there is much more signal there that got dropped when the data got flattened into summarized into a single... So to me, single table problems.

23:40

are are you know, are solved. I think the differences are kind of second order effects. What is unsolved is the multi table problem. Th that's where the wins

⁠¶ Benchmarking Relational Data Models

23:53

And so how do you think about benchmarking performance for these types of problems? Are there established? Benchmarks for multi table prediction problems.

24:03

Actually there is quite a lot of single table data out there because of all the history of um of uh machine learning. And I think even when people develop new benchmarks, From raw data, they just released that single table because everyone learns on the single table.

24:19

Right. So uh what we did actually at Stanford, we were like, okay, so where is a multi-table benchmark? And there is no multi-table benchmark. And even if you look at Kegel, out of thousands of competitions of on Keggle, you know, there are four that are multi-table. All the others are features have already been engineered for you, there is a single table and you start you know

24:39

Bagging and boosting and and and creating tricks until until you win, right? Um so we created uh a benchmark at Stanford. Um by collecting and curating open, multitabular data sets that we were able to find on the web. We call it RelBand.

24:57

We have now two versions of Railbench. It's about 40 different predictive tasks over, I think, about 10, 15 different databases. And then what's also interesting is that SAP... the big German IT company, they released a benchmark, a multi-tabular benchmark of enterprise data called Sol. So those are I would say the two the two big tabular, tabular or multi-tabular, so relational uh benchmarks, uh SAP from Salt, uh Salt from SAP uh and uh Rail Band.

25:38

Line of work that they've been doing and promoting here through Stack.

25:43

It also makes me wonder if there's a way to reuse existing benchmarks by like denormalizing, you know, wide single tables or something like that. Is that something you've looked into?

⁠¶ Aggregation and Feature Engineering

25:54

Uh that's a great point. Like you can try to denormalize, but if you think about it, you can only denormalize one-to-one relationship. As soon as you have many to one, you have to aggregate. And that's the key.

26:09

That's where you've lost information.

26:11

Exactly, exactly. And I I know I can dwell on this point a bit, right? Imagine you do you are doing a churn model. Right? So I a customer churn model could be I have a customer and here are historic transactions of the customer. I need to aggregate them. So first I say I'll count how many purchases you made last month. And then I'll maybe take the median price of those

26:36

And then, you know, some other data scientist says, no, no, let's take the cheapest price of everything you want, right? And then somebody says, no, no, you should take the most expensive. And then somebody says, No, it's the average. Another person says, Oh, but distributions are skewed. We should take the medium. Then another person wakes up and says, hey, it's about shopping in the morning. That's what's predictive of church. Let's add another feedback.

27:00

Right.

27:01

And then somebody says,

27:04

You're like just give me the data.

27:06

You know, like that's what I mean, right? And then you're like, Oh, holiday. People sleep longer on holidays. Let's now create a new feature that accounts for holidays. Oh, but then there is summer daylight change. Let's account for that. You see how kind of ridiculous this gets? Just attend over the transactions and Let the attention figure out what prediction.

⁠¶ Kumo's Relational Foundation Model RFM2

27:25

When I introduced you, I mentioned that you were co founder at Kumo in addition to the research. Talk about the relationship between the the research and what you're doing at Kumo.

27:34

KUMO is a komercial enterprise-grade platform that allows us to do large-scale relational deep learning models. Um and we are using this platform to two effects. One is to allow um uh partners, customers to train, uh tune single task models over the multi tabular relational data. Um and I can talk about uh that part. But the recent breakthrough uh that we had and we just released um And the second version uh is our what we call a relational foundation.

28:15

Um and that's a pre trained foundation model uh that can reason over structured relational data. Um and it's crazy what this model can do. So what this model can do, it can make accurate predictions on any database and any predictive task without any model training.

28:39

And I I find that proposition to be almost outlandish, like They're just numbers with some unknown relationship and you're gonna say that you're gonna train a model on just the relationship between random business numbers and it's gonna work in in some unknown use case. How? Make that make sense to me.

⁠¶ In-Context Learning for RFM2

29:00

Thank you, thank you. I I think it's great. I think as I say this, people who listen should should should be like if w what is this guy talking? So so thank you, right? So I agree, right? Because it's easy to say, Oh, it's a foundation model, you coo hoo, right? Great. But then okay, what does it really do? So Here's maybe how to think about this. So the key here is to do in-context learning, right? The the same way as a language model does in-context learning, where I give it a prompt.

29:28

I give it the information, I give it a task, and then it gives me the answer. So what we do here is the system has several, several components. So there is the database. And then there needs to be a way for me to instruct the pre-trained foundation model what kind of prediction I want.

29:49

Right. I wanna say predict me the sum of purchase prices over the next one month for this particular customer. And that maybe is like how much pr I'm predicting how much the customer is going Or I'm saying predict me, you know, uh transaction dot is fraud equals true for transaction ID. Okay, so this would be like predict me whether the transaction is fraudulent for this particular transaction, I think. Right? So I have a way to specify my prediction.

30:18

And now what the system does, the system now goes into the database. It extracts a set of labeled in context examples that then get passed through a pre-trained neural network to make a prediction. Okay, so now when I say a set of labeled pre uh um in-context examples, this means that you can take the task.

30:47

for the example of fraud, I've got historical fraud that's already been labeled and I've got some new transactions coming in that don't have that label attached that I'm trying to to predict. For example,

30:59

So let's do fraud. Fraud might be easier, yes. So the way this would work, right? If I say predicting the the probability of fraud, the system would go into your into your database and extract Previous transactions for which we know whether they are fraudulent or not. For each of those transactions, we would then extract kind of the subgraph of entities around it.

31:20

Okay, so now what the relational foundation model gets on the input is a set of historical subgraphs of previous transactions and their fraud labels. The new transaction that is unlabelled, we don't know it's fraud. And then this is passed through the relational foundation model architecture forward to kind of label the unlabeled graph, right? Like the unlabeled uh transaction and the graph around it.

31:50

I I'm not sure now that fraud is a good example because it's kind of I can see how that could work. Mike, you're... you're... You've collected the the graph around these known points and you're asking a model to infer relationships that might, you know, lead to this one individual uh label. Uh and so maybe I think maybe and this may be where you're going, like I think a regression type of a problem would strike me as more challenging than a classification problem.

32:26

Yeah, I think the key here is, right? What are maybe the key components here? First is that you have a language where you specify the task. We can go generate almost like a minute. labeled training dataset, this in-context example.

32:42

And then you have a pre trained model that is able to take these in context examples, these these subgraphs that have some certain columns and tables and so on is able to encode them in a domain-agnostic way, and then... the neural network is able to essentially build a predictive model in its brain in a forward path to give you accurate prediction.

33:09

Right, right. So it's not necessarily about like some universal understanding of numbers or what have you. It's about being able to identify the right relationships between numbers that it hasn't seen before. query the right, you know, examples and create the right universe and then formulate that as the the right um I guess like inference request or something.

⁠¶ RFM2 Architecture and Operation

33:35

Exactly. So there are I would say two aspects to this. One is how can you take data and encode it in a domain agnostic way, right? Because we can take any database, any set of uh any set of columns. We need uh the model needs to be able to encode that in a in a universal way.

33:53

And now that it's been encoded, then the second step is to perform in context learn. So it means that the model in its brain needs to be able to build a model, right? There's no training. There's no backpropagation. There's no gradient. It's just a single forward pass in which the neural network kind of

34:12

in itself builds builds I don't know the model in a sense that that gives us the accurate prediction. Right? So no training is necessary, no hyperparameter optimization is necessary, no feature engineering is necessary. All you need is a raw database and a way to specify the top.

34:30

Does the model require some type of memory structure, blackboard or something in order to Um, you know, do a scratch work to to come up with a representation or is this all like thought traces or something like that?

34:47

No no no this is not this is not an agent.

34:50

This is...

34:51

A single forward pass of a transformer like neural network, right? So this is purely inside the neural network. There is no agent, there is no memory, there is no scratch pad, there is no let me do this, let me do that, right? The the answer is truly a single forward pass of a neural network. There is no loop, nothing like that, right? So y you you get the answer in, I know, point two seconds, half a second, uh whatever the time be, right? It's really

35:19

A single forward pass of a pre-trained frozen neural network. There is no language model here, right? This is kind of technology that's parallel or or complementary to language models, right? You cannot Textify a database and then go to ChatGPT and say, hey, what do you think? How how likely is this transaction to be fraudulent? you get horrible results, right? So

35:44

This is, uh...

35:46

Ja, frozen pre-trained architecture that allows you to do that.

35:51

I feel like I've gone the full cycle from that's an outlandish claim to oh yeah, I can see how that will work to I don't know, it's still kinda crazy that it works.

35:59

Yeah. No, but it's interesting, right? And when we when we test this on data sets that are locked away and hidden and the model has never been trained on and on task. Tasks that we haven't even thought about, we see we see a we see a gain over best supervised uh model south. Right. If you would go and say I I'll hire a data scientist, they'll spend uh several weeks building the model, tuning the model, the latest neural networks, whatever, this it's still a couple of percentage of points work.

36:30

Um and then if you fine tune the let's say the foundation model uh on on on on more data for the specific task, then you get to this superhuman accuracy performance that, you know, present manual or semi-manual or agentic solutions are just not able to attain.

36:50

That is the RFM two, Kumo RFM two, the relational foundation model.

⁠¶ RFM2 vs Other Relational Transformers

36:56

You also recently published that iClear Relational Graph Transformer is the one based on the other or are they independent uh lines of research?

37:11

At Stamford we are pushing forward in the open new architectural improvements, understandings, and as much as we can as academics. We release everything open source. We talk about everything. And then, of course, what happens um what happens um inside the company is that some some of these innovations that that we put out also also also kind of diffuse uh diffuse inside. I would say that internally the

37:40

The architecture we are using is a bit different. It's composed of two different parts. The first part is basically the encoding or the attention mechanism over this. Set of uh set of tables. Um and then the second part is this in context learning uh type machine. Um there are two papers uh that are relevant here. One is the relational graph transformer that we only mon mainly use for supervised uh fine-tuning type tasks.

38:12

But then another paper we also published at iClear, it's called the Relational Transformer. And that one actually allows for in-context learning.

⁠¶ Attention Mechanism and Context Size

38:20

So that one does attention all the way at the individual cell levels of a database. And essentially, you have three types of uh attention. You have attention over a given column. So if you are interested in a cell by attending over other cells in that same column, it kind of gives you a sense of a distribution. Then we attend over the cells in a row. Uh and that kind of then gives you a sense of what's the information al in that room.

38:49

And then we also have a graph-based attention mechanism that allows you to say, oh, this is a user and these are all their transactions. And then each transaction is a row and each row has columns. So this means that we can be attending over millions, tens of millions of cells. And the beautiful thing is that our attention mechanism, because of the graph, has much more structure, so the attention mechanism is never quadratic.

39:17

And this means we can we can compute much more effectively. And to do good reasoning, you really need humongous context sizes, right? Even the largest LLMs today, I know, go to a million token.

39:35

So I was going to ask Are there Are there data requirements or shapes or uh use cases that

⁠¶ Data Requirements and Cold Start Benefits

39:50

this, you know, works well for or conversely doesn't work well for. It sounds like part of that is size. Like you need a lot of data in order for this to work. Is that fair?

40:01

I would actually uh uh maybe push back on that a bit. Actually because the model is pre-trained, it can do amazing things where you have very little data. Because training models from scratch, yeah, requires a lot of data. But once the model is pre-trained, it kind of knows what functions kind of appear in nature. So it means that you can give it a few examples and it's going to give you very accurate predictions. More accurate predictions than some supervised model that you have to...

40:33

I think I picked that idea up based on you saying that the context that you work with is typically large. Um saying that when you have a lot of data available, you can use it, uh, but you don't necessarily need it.

40:48

Exactly. And then once if you have a lot of data, you can either increase the context size And and by increasing the context size, you get more accurate predictions. Or if you are saying, oh, I'm doing fraud, you can just fine-tune your model for fraud. in a sense that you don't even have to do in context learning because you know your data, you know the task.

41:09

You just tune the model for that single task, and then the model can be smaller, much more efficient to run, and also more accurate because it doesn't have to, you know, almost like re relearn the the task every single time because you give it the What we see works best is some mixture mixture of pre-training and in-context examples, because the the way you choose in-context examples can actually depend on what the what the target entity is.

41:35

So in a sense, you'd say, oh, if I'm predicting fraud for me, then you could say, oh, let me put some other Stanford professors in my in-context examples. Let me put some other Bay Area folks in here because, you know. That's kind of the peer group or the most useful examples from which you can learn to make accurate predictions about me being a...

41:59

I'm thinking about the line between kind of the model and the system. The system is what is constructing the in-context examples and the model is just that forward path.

42:09

And you need both, right? I think is is important, right? Because somebody has to generate these in context examples. You won't generate the manual.

42:17

Right. And is that part also learned or is that, you know, a kind of a formulaic graph traversal or something else?

42:26

Somewhere in between, you can do it as a form kind of just as a graph traversal, uh and a bit of kind of time travel, right, to generate the forward looking labels. Um But of course, how you do that and what in context examples you generate makes all the difference.

42:52

And so speaking of performance, uh you talked a little bit about some of the challenges with uh collecting benchmarks, but How do you find performance relative to those benchmarks and uh also, you know, more importantly, in the real world?

43:11

Yeah. So I can say, right, like um we have a white paper on uh Kumar FM uh two uh that that people can read with a bunch of different benchmarks. Um what we see is that The foundation model um by itself improves uh state of the art uh over all supervised models ever published on these on this benchmark, right? So so the baseline is very high. It's like

43:36

Just build the best model you can and see how high you can get. Um uh uh the foundation modeling improves that I think for about five percent uh relative uh the accuracy. Um and then if you further tune The model, meaning if you would fine-tune it and do some gradient-based updates, then the performance goes to 12% over the state of the art. And those are quite sizable gains, especially if you think about putting this in production.

44:05

recommender systems or fraud detection where, you know, every single digit performance in increasing accuracy can mean millions, tens of millions in business. Maybe the second thing I would say is where we see these methods also shine is with noisy and incomplete data, cold start problems. because of the relationships, b uh b because of the relational structure, the model is able to much better kind of hone in and be much more robust.

44:37

to the data missingness, data corruption and things like that. So we've also done quite a lot of analysis around understanding and sh and uh like How this performs in on real world data, uh sparse data, small amounts of data, noise, incompleteness. Irrelevant columns and the same.

44:58

Mm. And and when you mention cold start, like that suggests, hey, I want to start identifying fraudulent transactions, but I have no labels. I just have a bunch of data. Can you tell me where I should start looking? Like Does it work for that kind of problem?

45:14

Yeah, maybe I should quantify what cold start means. Usually cold start would mean when a new user shows, a new product shows up, right? So you still need to have some historical labels and not... You still need some historical labels, but usually, you know, prediction is easy once you have a lot of data about a given user or a lot of data about a given product. But when the product is fresh, or when the user is fresh,

45:39

Uh you don't y you are data poor. That's what technically is called cold start problems. So I still need historical labels, but to make reliable predictions, I don't need much data.

⁠¶ Real-World RFM2 Deployments

45:51

I think I saw somewhere that the system is deployed at like places like DoorDash and and others. Can you talk a little bit about the process for deploying it?

46:03

Yeah, uh gre great uh great question. So the the system, the platform, we we can deploy it in many different ways. We can you know run it as as a SAS Basically, as a compute platform, we can deploy it in people's private public clouds, like inside, we call it virtual private clouds, so all the data stays with the customer.

46:28

Uh there's a bunch there's I would say a bunch of different deployments depending on what organizations uh like and and prefer. And then yeah, in terms of let's say deployments or use cases, right, that uh at DoorDash it's uh um restaurant recommendations and the notification system, which user gets what no notification at what time of day and things like that. And we've seen, you know, uh revenue impacting hundreds of millions of dollars. Another great client we work with is Reddit.

46:58

Advertising models on Reddit are built on top of, or are built with Kuma. nearly a double digit uh increase in at uh click through rates. So basically the the revenue the yeah, it's like unbelievable. Like usually an entire team You know, like increases maybe 1% that accuracy year over year, right? Because click through the radius.

47:24

This is your original point about like uh you know, domain expertise and like manual features, like You would imagine that they've been working on this for a long time and they've kind of squeezed a lot of the juice out of that lemon, but you know, here comes the machine.

47:39

Csak, and it's actually interesting. We have a great collaboration and a great relationship with the Reddit team, and they are amazingly sophisticated. And of course, they build their own super-optimized, feature-engineered pipeline. And then and then the way we did we do it there actually is that we said, Okay, let's take your your data, represent it as a graph and let's create embeddings for users, subreddits, ads and things like that. So now these embeddings actually get

48:07

appended to their to their own features. Right? And even with that, there was there was a huge increase in the click-through rate because this signal that the neural network learned was kind of complementary to what the human feature engineering uh already helped. So actually the the model that is in production is combined from the neural network embeddings by by graph embeddings by Kumo as well as the manual feature feature engineering.

48:37

Uh so that's been that's been a great uh great collaboration. So it's add add the recommendation, click through rate prediction if you wanna think of it that way.

48:45

You know or have you looked at like if their manual features really make a difference? Like is that a feel-good thing like you left them in there because they had them? Uh or do they provide, you know, lift that's been measured?

49:00

Ah good question. Um I don't think I don't think we tried turning those off yet. But uh it's uh it's a very it's a very interesting uh it's a very very interesting uh question. Sometimes you still want to have those features in there not maybe for the model accuracy, but because you have so many business rules. You know, like this, advertising systems, they're not just

49:24

pure optimization place. There is so many kind of other business rules that need to trigger for the ad to be actually shown to the user. So you kind of need sometimes those signals to be able to trigger business rules.

⁠¶ Explainability and Business Rules

49:38

And I I wonder uh in the in this case and with those uh hand engineered features and more broadly with uh RFM you know, what kind of explainability story there might be. That's another reason why people like XG Boost is that those trees are fairly interpretable and and that's been a challenge with, you know, transformer based networks.

50:03

Yeah, that's a great that's a great point. Actually, uh I would say we do explainability really well. And it's even more models I would say are even more explainable than these tree based models because in three-based models all you can get is you get a rank list of it. Right. So you can only explain predictions by the with the features you you engineered. What we can do is we can do this and

50:27

Those features might be, you know, wrong understandings of the data or incomplete. Yeah.

50:32

Exactly. So what we do is we we do the we can do because the model is fully differentiable, we can basically run the model backwards and we see what tables, what columns, what cells the model is attending over. And we get this structure based explanation, but then we we use a large language model to say, here's where the attention is, this is what the columns are, and this is what their semantics is, and then we generate a text-based explanation that is like Super readable.

51:01

Explanation of a saliency map of the data or something like that.

51:04

Think of it maybe that way, right? But the LLM is kind of enriching it with all the human world background knowledge and so on, so it gets very actionable.

51:14

Uh so you were talking about use cases, you mentioned we we were talking about the recommendations.

51:20

One last one, which is around fraud. So we've seen great results with fraud. Here we've been partnering with an amazing team at Coinbase. Um so we have these models running in production at Coinbase on the entire Bitcoin blockchain network.

51:37

Um right. So also we can scale to, you know, the si to the size of the entire edit, to the size of the entire Bitcoin or Coinbase. Uh these methods really scale. But then you know, with some with some clients like Let's say Databricks and Snowflake, they are using us to run their sales mods, predicting what the customer is going to buy next, which customer is going to convert into a paying customer, and this allows.

52:06

them to to optimize their sales team, right? And if you think about sales team data, that data is smaller because the sales teams are, you know, with hundreds, maybe uh, you know, thousand, thousand thousand people. So you can do well on small data as well.

⁠¶ Cost, Limitations, and Best Use Cases

52:22

You know, there are aspects of it that sound free lunchy. Like w w how am I paying for my lunch? Like what's the you know, is it um Yeah, I maybe I'll leave it there for you to ask answer.

52:35

Yeah, I mean at the end what is being paid for is compute. What we are really doing is, like machine learning, if you think about it, is really a CPU compute. Right? Majority of the of the compute that happens, except maybe the final neural network training, happens on the CPU. What we are doing is taking that workload from the CPU to the GPU. So now the amount of GPU compute is larger because it's computed over the raw data, not over the summaries generated on the CPU.

53:11

Um and and that's uh that's I would say what the uh what the co what the cost uh what the cost is uh in the end. Of course these models and uh are not in trillions of parameters. They are billion-parameter type ones. They can be quite small, so they're actually quite efficient and cheap to run. because the reason we are making predictions is to make decisions.

53:39

Right? Our commander system is making decisions what to show to every user. So we are like the speed of those decisions is at, you know, tens of thousands, hundreds of thousands, millions of times per second. So performance cost.

53:55

Yeah. I think I was also trying to get at limitations and you know, if you had someone come to you with a problem that, you know, was in fact like multi tabular relational you know, what might be, you know, some reasons why, you know, you ultimately, you know, tell them that it's probably not a good fit.

54:18

We know if we use our technology we'll be at least on par or better to what is already. Right. Um now Where we see bottlenecks. Usually we see bottlenecks in actually getting the value out, like connecting those predictions to some decision making downstream business process so that the value can be reliable.

54:40

That's been, I think, the biggest bottleneck, right, in a sense that models are built, developed, they work great, but then engineering teams need to hook them up to actually, you know, surface those predictions or make decisions.

54:53

uh based uh based on on those uh on those predictions. Another another right, like maybe that's one use case. Another use case is sometimes where we shine or where the technology shines in these predictive, well-defined predictive type products that can be mathematically well formulated and optimized.

55:16

If it's more about, hey, we wanna understand the patterns, we wanna understand what is happening in the past, that is much more, you know, this kind of traditional data analytics type things or some pattern detection type thing that that our platform and what we discussed is maybe not a good use case. So that's, I would say.

55:53

And y you're not gonna solve the traditional data science problem uh like in organizations of you've got a model, how do you use it? Um that's still going to exist.

56:05

Exactly. There is still the problem of how now that we have the model, how are we pushing that to to I I wouldn't say to production. That's easy. The question is how do we connect it with the downstream app or the downstream system so that actually somebody is acting on this uh on this prediction.

⁠¶ RFM2 for Agentic Systems and Future

56:25

Right. And and where we also see, I would say, a lot of traction recently is in agentic work. Because agents need to make decisions. To take actions. And now you can make decisions based on this, you know, LLM-based common sense. But the best way to make decisions is to estimate or predict their downstream effect.

56:51

So now I'm envisioning this model sitting behind a tool interface that an agent can call to uh you know, when it needs to make a prediction about the data.

57:01

Exactly, exactly. Right. And even if you think about, let's say, a customer support agent. You call me in, I need to estimate what's your lifetime value, how likely are you to churn? I will respond differently. What's the best offer for me to give you? Ha I need to actually ask a counterfactual question. If I make you this offer, how will that make you happier and and and these are all predictive problems? I cannot just hallucinate them or you know ask Chat GPT. It will do something reasonable.

57:29

Right? something I would say common-sensy, but that's far away from optimal. So these predictions, this reasoning over this structured relational data that captures the patterns, behavior of, let's say, customers inside the organization. is crucial to make accurate decisions. And as we are deploying these agents, we cannot be building now separate models for each of these and pre-anticipate the questions. The beauty of the foundation model is that you can ask any...

58:02

And one thing I wanna say here it's like just to show how big the problem is, right? Like if you think about a organiz like maybe like uh you know, think of let's say SAP, right? Uh SAP has I think I know, seventy, hundred thousand customs.

58:15

Each of their customers has structured data because it's an organization. Every one of them... changes the schema a bit so everyone has their own data, have their own schema, and every one of those wants to do a churn prediction, uh wants to do churn But every one of them has a bit different definition of what churn is.

58:37

So now can you hire seventy seventy thousand data scientists that are going to build per client churn model with the client's data and the client specification what the churn churn means? You can't. A foundation model An agent can just ask. Predict me probability of churn under this definition, under under this data, and you get the answer half a second later.

⁠¶ Post-Training and Agent-Friendly APIs

59:02

That raises a question for me, it around uh like post training the model or fine tuning the model. Does it is there any value to um like intermediate Fine tuning. I I think what I'm envisioning is like you partner with SAP. SAP has, you know, hundreds of these modules. There's like a supply chain module and there's a, you know, churn module and some other thing like Does it make sense to tune on, you know, the use case?

59:41

um, you know, separate from the individual customers' data or does the foundation, you know, the breadth of the foundation model already capture all of the information? you know, at that like use case level of abstraction and really you're only improving it if you're looking at a specific customer's data.

01:00:01

That's a great question. Actually, that's something we are uh deeply looking into right now. I would say there's several reasons why to post training. One reason you would want to post train, even in the in-context learning scenario, is to better learn the distribution of the underlying data, to better understand the distribution of the underlying data.

01:00:24

So that prediction then like the data gets better encoded and prediction later will be more So that's one reason you would want to, let's say, train even in a task agnostic way over the underlying data to better capture distributions, to better learn. Another reason why you would want to fine-tune is for cost reasons. Because if I fine-tune for a specific task, I don't need to do ICF.

01:00:52

Right, because now I don't now my context is is much smaller. I don't need to bring in the label data, I just bring in the entity I wanna predict on. Now the attention is smaller, it's faster, it's cheaper to run if I have large amounts of data.

01:01:10

You know, the mall the mall can learn uh can learn a lot from. So I would say there is there is, you know, there is a spectrum, there is a continuum of what you can do and why and the benefits of it is kind of different depending on where on this continuum are you seeing.

01:01:25

So what's that?

01:01:27

Yeah, what's next? We are very excited about agents, both basically surfacing these two agents as tools. Uh the second thing is right now right like the the the coding agents are are uh out there. But what we see is that coding agents require a proper abstraction and a proper infrastructure to be able to be effective.

01:01:52

Right, and and for example if you s you could say, Hey, why don't I just, you know, give this modeling task to Clot Code and Clot Code will build the model for me, so you know, what's the big deal? Και όταν κάνουμε αυτό, όταν δείχνουμε, όταν δείχνουμε, όταν δείχνουμε, είναι ότι... these models write thousands of lines of code but there are these like super subtle data science mistakes

01:02:17

So for example we've we've done this uh together uh together uh uh with Expedia. Um and uh you know when when it was uh account level fraud. And mistakes, for example, the agent make was that when it created features for that given account, it created, it aggregated the transactions till midnight. Not till the current time.

01:02:44

Right. So it said, oh, today is I don't know uh uh April thirtieth, so we'll use the data up to midnight of April thirtieth, not actually saying, Hey, it's actually ten A ten ten AM on April thirtieth. We can only use data up to here. Right? So that's information leak. was a little mistake in that another mistake it made was that you know we did it at a transaction level instead of the account lab and these are like these subtle mistakes that really you need the human

01:03:10

But if you give it this more higher-level Kumo-like API, then it's able to do the same work in about 50 lines of code. No mistake.

01:03:23

The task in this case is to is to do what? Like I thought the task that you were describing was to code up something like what Kumo's trying to do.

01:03:35

The task is build me an account level fraud detection model over this day.

01:03:42

And so what you're what you're proposing is like as opposed to trying to the agent trying to code it up from scratch, you create some kind of skill or something like that that teaches it how to use Kumo to get the same

01:03:57

Yeah, or what I'm saying is agents You know, they they can go they can autonomously maybe make two steps, but not hundred steps. So now when I ask it for a task, I can say, hey, here's PyTorch, go build me the model. That that's you know, takes thousand lines of code to build a model with Python. I could say, here's XG Boost, build me a model. That takes about engineer features and so on. That's that takes about 500 lines.

01:04:31

And and you know. Uh or I can say uh using the Kuma API, go build me the model. That only takes fifty lines. And the the prob the the now if you think of this analogy of steps, you know, fifty lines of code is maybe like two steps. Five hundred uh uh lines uh lines of code is uh is twenty steps. Right and a lot. Uh navigating, I don't know, twenty steps. I think the observation is more general. be effective. They need APIs that are agent or agentic-friendly.

01:05:16

Awesome. Well, Yuri, thank you so much for jumping on and catching us up to what you and uh Kuma are up to. Super cool stuff.

01:05:27

Yeah, thank you so much for the conversation and uh very insightful.

01:05:31

Awesome. Thanks so much.

01:05:32