Build Intelligent Applications Faster With RelationalAI

Unknown

00:10

Hello, and welcome to the Machine Learning Podcast. The podcast about going from idea idea to delivery with machine learning. Your host is Tobias Massey, and today, I'm interviewing Moham Aref about relational AI and the principles behind it for powering intelligent applications. So, Moham, can you start by introducing yourself? Hi, Tobias. Thanks for having me. Yeah. I'm I'm Moham. I'm, CEO and founder of Relational AI.

00:37

We are building a, an extension to platforms and systems like Snowflake that allows them to support AI style workloads that they don't normally, support. I live in California, and, I'm an engineer by education. I've been doing AI and machine learning related things under different names, for a little over 30 years now, going back to the early nineties when I worked on a newer earlier generation of neural networks. Yeah. It's definitely interesting the,

01:08

shifts that we've gone through in our phrasing around these things when the base elements are the same. I remember when I first started hearing about this, it was all about data mining, and then it became, you know, machine learning. And now everybody just talks about AI, which is the output of the machine learning, but we just wanna ignore that part of it.

01:26

Yeah. I I we the company I worked for in the early nineties had a trademark on the term database mining, and so people sort of evolved to using the term data mining. It's funny that you bring that up. And back then, AI was at such a negative connotation because we had come out of an AI winter that, some people called it computational intelligence, not artificial intelligence. So And do you remember how you first got started working in machine learning?

01:51

Yeah. I mean, I've always been interested in having teaching computers to do things that, you know, approximated some aspect of, what I do or what people do. So it's an interest from an early age. But I I remember taking a course in, in the graduate school at Georgia Tech that, was about neural networks. And I also took lots of classes in signal processing, so that was sort of an early exposure. And then I got a summer internship with, AT and T working on computer vision systems.

02:25

And I was hooked. At the end of the summer, they offered me a full time job, and I sort of gave up my academic aspirations and joined to work on computer vision systems in the early nineties. This is a long before deep learning, and this is when you were still doing future engineering by hand and all that stuff. But, yeah, it was it was too much fun not to do full time.

02:45

And now bringing us up to what you're building today with relational AI, can you give a bit of an overview about what that is and some of the story behind how it came to be and why this is where you wanna spend your time and energy? Yeah. So, you know, the my interest in the area started with the models. You know? How do you teach a computer? How to recognize something or see something? Or how do you predict something before it happens and and and so on?

03:12

But it becomes pretty obvious pretty quickly that the model and the model building is actually is a small part of the overall solution that you have to build to deploy that model. So, for example, I work for a company called HNC Software in in the early and mid nineties, and they had put they had sort of I think they were 1 of the first companies to make money off neural networks by

03:35

using them to build credit card fraud detection on solutions. Okay? And so the model, of course, that predicted whether a transaction was fraud or not was a very important component, but there was so much more to it. A lot of data management that has to happen in order for you to curate the data that shows you examples of fraudulent and nonfraudulent credit card transactions. There's a lot of infrastructure you have to put in place to score the transaction and do it real, real time.

04:00

The way you build a model has to accommodate the scoring criteria. Right? Because you can't afford to process a terabyte of data when you score, you know, a 1000000 credit card transactions an hour or something like that. Right? So once you get a score, what you do with that score plugs into some workflow that the credit card issuer has to, you know, you know, has to feel good about and supports their business strategy.

04:27

So he quickly realized that you have to combine a variety of technologies together to put together an intelligent application, you know, transactional data management technology, analytical data management technology, planning technology. You have to combine programming languages and capture business logic and workflow and all this stuff, and it was really, at some level, awful, soul crushing work to

04:50

to do all of that stuff. And I've spent the last 30 years thinking about how to simplify that and how to make it so much easier to go from model creation to deploying it in in real world, context. So And then also when I was going through your site and getting ready for this interview, 1 of the phrases that jumped out at me is this idea of the AI coprocessor.

05:13

And you mentioned that relational AI is designed to work in conjunction with something like Snowflake, but I'm wondering if you can just unpack that phrase for me a bit and explain what it is that you're trying to convey with that terminology. Yeah. So it's obviously a a, you know, an analogy to coprocessors, like hardware coprocessors that, you know, live on the motherboard of your device, for example.

05:39

Back in the, eighties, you used to have co processors for doing floating point arithmetic, for example. Today, you have GPUs for, that sit on the same other 1 as the CPU, and the CPU can offload certain computations to the GPU because it's much more effective at doing them. So if you wanna do graphics or gaming or, if you wanna do, machine learning or anything like that, it it you can do it on a CPU. It's just much more expensive, much slower. So what we're doing here is by analogy,

06:10

we have a a database coprocessor. It's a software coprocessor to database platforms and data clouds like Snowflake where we help them do things that they're not designed to do. So for example, prescriptive analytics, you know, working with solvers like linear programming, integer programming solvers, or simulation or certain types of machine learning or graph analytics

06:34

or rule based reasoning. So a combination of techniques that are symbolic and then probabilistic and statistical that we know we have to use to build real world enterprise applications. Snowflake doesn't do do those things,

06:49

and they will tell you that they don't do those things. And your the alternative to working with a solution like ours is to copy data out of Snowflake, put it in a point solution that does the graphs or that does the prescriptive analytics or the predictive analytics, have it operate there, and then you bring it the the the results back in. That's problematic in, in a lot of ways. And it's that kind of glue that you have to

07:13

use to do that is what creates the soul crushing work that I told you about earlier. So a coprocessor is embedded in Snowflake. It it's runs inside the security perimeter. It respects the same governance machinery that you have in place, and it eliminates the need for manual data synchronization because the data structures are automatically synchronized with the data in Snowflake. So we we organize data in the form of a knowledge graph as a set of materialized views on the tables in Snowflake.

07:42

It the other criteria for being a coprocessor is it's cloud native, architected in the same way as Snowflake, separating storage from compute, which means you get time travel and versioning, 0 copying cloning, workload isolation. You get consumption pricing, which is very important for people who use the cloud, cloud computing, and systems like Snowflake. And then the 3rd criteria for being a coprocessor is that, we implement the same paradigm.

08:05

Meaning, you know, Snowflake organizes things in early using the relational paradigm, and we're a coprocessor that that's you support these workloads I mentioned also relationally. So you don't have this impenas mismatch, and you have to translate, data structured in the form of relations and tables into navigational graphs or tensors or some procedural abstraction and so on. So by being embedded, by being cloud native, by being relational, it it just eliminates so much friction

08:35

from, you know, deploying AI and and and and building intelligence into applications and building intelligent applications in general. So That's another interesting phrase to unpack that gets used in various contexts is the idea of an intelligent application where to some measure software in general is intelligence because it is being embedded with the business rules that are, you know, painstakingly encoded by the software engineers.

09:04

But, typically, that idea of intelligent applications implies some level of autonomy and self learning. I'm curious if you can give some color as to how you think about the situations in which that phrase is applicable, and what are some of the ways that maybe you've seen it misapplied? Yeah. So it's really interesting that you make the connection here, Tobias, to the fact that application logic is is, in some sense, codifying some,

09:32

domain knowledge, background knowledge that's usually in the head of a business person or a developer. Right? So, that is very sort of, like a a classic symbolic representation of knowledge and and and expertise. Now you can do that. You can represent that procedurally

09:49

as is the norm, these days. Your application logic is written usually in a Java or c sharp or Copal or some procedural language like that. You can also represent it declaratively. Okay? And so SQL is in a sense a step in that in that direction because you, you you basically can ask a question

10:08

without, necessarily telling the computer how to answer that question. It sort of figures out an execution strategy for answering the question. It optimizes it. It paralyzes it. It support you know, takes care of, you know, out of core memory concerns and all that kind of stuff. So

10:23

but there are many more ways of representing knowledge. You can represent knowledge statistically. You can represent that knowledge in the form of parameters to a neural network. Okay? So it's sort of a way of taking, lots and lots of examples of some phenomena you're studying and and then compressing them into as much smaller, dataset that's the parameters of the model.

10:42

Even these very large language models, I think 1 of the abstractions that you keep, you know, reading about is that they're really compressing the data that they've been trained with. Okay? And so that compression is a way of abstracting over detail and capturing the essence some, you know, some fundamental aspect of the underlying phenomena.

11:03

So, yeah, in that sense, you can argue that all applications have some form of either symbolic or statistical intelligence. But really what I'm talking about, and I think what the world means when they talk about intelligent applications, are applications that are helping you predict some outcome, helping you, see into the future a little bit and knowing, for example, that something is likely to be fraudulent or knowing that some user is likely to prefer a certain, you know, product

11:31

or, knowing, you know, how much someone will pay for something. And then take action, you know, using maybe even prescriptive intelligence, prescriptive analytics, take action to maximize the value to that person and to the business serving the person. Okay?

11:47

So, you know, usually, or historically, at least, you know, many, many types of apps like ERP apps and HCM apps and so on, you the they're just sort of getting input from a human, and the and the human has to predict and decide what sort of the best course of action is here.

12:05

But when you can make the application sort of help you with a decision by giving you plausible predictions and giving you plausible recommendations as what to do about them, then that makes the application more intelligent because it's taking on some of that cognitive burden that you would you would you would have as a user,

12:20

otherwise. So And for people who are employing relational AI in their business contexts, what are some of the primary use cases that you're focused on solving for them and the reasons that they that they would turn to relational AI in place of or in addition to other solutions that they may already be using?

12:42

Yeah. So, yeah, thank you for that, question. So, again, if you're not using something, Relational AI or something like Relational AI, you're typically having to resort to stitching together a lot of point solutions that specialize

12:56

in these various, you know, workloads that you need to you need to put together to build intelligence into an application. So specifically, again, graph analytics, rule based reasoning, prescriptive analytics, like integer programming, linear programming, simulation and probabilistic programming, machine learning methods in a variety of flavors and language models. Okay?

13:17

And so we have clients, for example, that use graph analytics to build more sophisticated features that they can feed into fraud models and improve the predictive accuracy of those fraud models, saving them 100 of 1, 000, 000 of dollars. Okay? We have clients that are capturing semantics that go with their data. They're moving all their data from, you know, thousands of databases,

13:44

that, you know, support 1, 000 and tens of thousands of applications. They're moving it into platforms like Snowflake, but,

13:52

they can't move the business logic because the business logic is is not, you know, relational in SQL. And so they use us to to capture the semantics declaratively so that, they can, you know, they can understand the the relationships between various data silos and navigate these data silos in ways that they wouldn't otherwise be able to navigate, again, uncovering 100 of 1, 000, 000 of dollars of value, in the process.

14:18

We have clients that are replacing legacy, applications that were developed, you know, in the traditional way that you described, you know, 100 of thousands of lines, procedural code and representing that the background knowledge, declaratively and then reasoning over that knowledge in a way that, you know, reduces the amount of code by a factor of 30, 40, 50,

14:40

making it, you know, more accessible to business users, making it more scalable, making it, you know, higher quality and more adaptable and and so on. We have clients that are starting to combine what we do with knowledge graphs

14:54

and language models to help language models answer questions more accurately and more effectively. So you you might have seen folks, for example, that use a language model, ask it a question, and then generate a SQL query that can run on, data living in a in a SQL database like Snowflake. Well, if the data models are really trivial and the question is really simple,

15:14

usually, you don't need to help the language model that much. It can usually kinda give you the SQL query that gives you the answer. However, as the questions get more complicated or the data models get more complicated in the real world, you know, we see customers with 100 of millions of columns of information in in Snowflake.

15:31

The language model needs something like a like a semantic layer or a knowledge graph to help it, you know, navigate the data. And it also needs the ability like, if you're gonna ask a question that, you know, helps you identify groups of or clicks of connected customers in your database, well, that's a graph query. That's not a a a query that a SQL

15:52

database can answer very easily. Or if you wanted to find ask a question that says, says, you know, find me the shortest path from point a to point b, that's, that might require some kind of reasoning capability or solver capability. Or if you're asking a question like, what will my sales be,

16:07

of Coca Cola product in the northeast next week? Well, that data doesn't live in Snowflake. You have to sort of infer it from the data that lives in Snowflake using maybe graph neural networks or other technologies to to complete, you know, the the the database with with information that it lacks. And,

16:24

that again can't be answered in straight straightforward SQL queries. So there is just so many different applications of this technology across the board for improving existing models, for creating new types of models, for replacing very complex, fragile code that tries to glue a bunch of things together

16:46

for augmenting the power of language models. You know, I can keep going. I I I might have taken too much time with my answer, but I just there's just so many different ways you can use this. For people who aren't using relational AI, you mentioned some of the types of systems that they might lean to having to cobble them all together. What are the, I guess, orders of magnitude of ease of use or maybe efficiency improvements that they might expect by moving to relational AI?

17:13

And are they able to just completely obviate whole classes of their infrastructure, or is this a situation where those other components still serve a role, but they are not in the critical path as much as they would be otherwise? I think mostly Aviate. So if you have again, I I use Snowflake because that's the platform that we we launched around, last June, and and there, they created this category of a data cloud, and they, I think, lead that category.

17:41

We do see other systems like, like BigQuery, for example, in the Fortune 500 and and and big companies. But, generally, Snowflake is the is the is sort of the most popular platform. Yeah. So if you if you have Snowflake and we use us as a co processor for the workloads that doesn't support, then generally, you're just not having to deal with whole

18:03

classes of systems. You don't need to copy the data out and put it in a navigational graph technology if you wanna do you know, wanna improve feature engineering with, with graphs. You don't need a totally separate, you know, rule based, system to build dynamic applications that have evolving ontology and evolving, business logic. You don't need a whole separate stack for, you know, integrating, Gurobi or cplex or

18:30

a variety of other integer programming, linear programming sol solvers. I keep coming back to these because we don't often talk about them in the context of machine learning anymore. But this, like, like, 1 of the original, like, magical technologies that made it possible to do things that you could have classified as AI, and you should we should still probably classify as sort of important to AI. In fact, all machine learning is built

18:52

on optimization, mathematical optimization. Right? Usually with with the gradient descent, type techniques. And so optimization is a very fundamental building block for intelligence. And so our supply chain networks, like the world runs using, this kind of technology. Like, no

19:09

truck or airplane or train or ship goes from point a to point b usually without being scheduled using these kinds of techniques. Today, there's just a just a totally different stack that you use for that that lives outside of your primary, data management platforms. What happens if it lives inside? Okay. Same for simulation. So I'm not saying that, you know, overnight, people will have turned all that stuff off and and use us in Snowflake. I'm saying, like, you can avoid compounding

19:38

the hairball that you have now by, you know, starting to do new work this way. And then over time, refactor the old stuff, the legacy stuff so that it's running in a more streamlined fashion. Okay? But certainly, the more point solutions you rely on, the the more energy you spend on glue between the point solutions than actually creating business value, and we we we wanna eliminate the glue.

20:02

So in terms of the actual platform that you're building, can you talk through what the system architecture looks like, some of the ways that you think about the technological underpinnings of the problem, and how you've approached that architectural design element? Yeah. So, so, again, thank you for that question. So 1 of the really interesting

20:25

things about what we do is we do all of this stuff in a relational context. Right? And so my experience, you know, being maybe slightly older than most of your audience here, if you observe, you step back and you look at our industry over the last 40, 50 years, going back to, you know, the seventies when the relational model was first introduced,

20:46

usually the relational model is not taken very seriously for the important workload of the day. So in the seventies eighties, it was transactional systems, so OLTP and, you know, real engineers built transactional systems using navigational database technology, and they wrote COBOL and they followed pointers around to to do transaction processing. And the relational model was dismissed on a variety of ways, including, hey. It's never gonna perform. It's too slow. And

21:14

also including that, you know, real engineers don't wanna understand relations, and they wanna write programs. They wanna write code. Okay. And then, of course, the the the database community invents joint algorithms, data structures like b trees, you know, you know, techniques for dealing with ACID properties and concurrency and so on, languages like SQL. And seemingly overnight,

21:37

the world switches from using navigational database technology to relational technology for building transactional systems. Okay? And companies like Oracle are are created with other other players in the area that, that basically no 1 thinks about building OLTP systems now. Like, checking system, like, taking $10 out of checking, put it in, say things like you'd be nuts to try to do that in a in a different kind of technology.

21:59

So in the nineties, the same kind of phenomena, analytics, being able to do descriptive analytics, lots of aggregations and lots of, BI type stuff, it was deemed, like, certifiably nutty to try to do that in a relational database because the right answer was clearly multidimensional arrays or what today, what we would call tensors because that's how you get high performance, and that's how you do things. And don't waste our time with the SQL relational stuff.

22:25

And the community invented column stores, bitmap indices, vectorized query processing, and all of that stuff. It's a beauty of the relational model is you can separate the abstraction from how you you do the work. And, we went from debating, you know, Molap, multidimensional

22:43

Olap versus roll up to basically the only Olap you get today is relational. Like, even if you Google Molapp now, you don't you don't get that many hits anymore. It's sort of been erased from the collective memory. The 3rd sort of version or example of this is 10, 15 years ago, big data was a new workload. Relational model's dead. Stubs not gonna work. MapReduce and Hadoop is the answer. There were 3 big Hadoop companies, you know, Cloudera and MapR and Hortonworks

23:09

and dozens, maybe hundreds of companies built, you know, and funded to build analytics on top of Hadoop because that's the way you do big data, and that's the only way you can do big data. There was 1 1 exception to that 1 company, little company called Snowflake that in in 2012 was saying, actually, you know, the relational paradigm lets us pick a different architecture,

23:28

you know, as cloud native architecture where you separate storage and compute, and we can actually do big data without all the complexity of Hadoop and without all the the procedural stuff that, you know, that we have to worry about. And, you know, the CEO of Snowflake from, 2014 to 2019 is, is Bob Muglia. Bob is a huge supporter of our company. He's a big investor and is on our board and is very active in our company.

23:52

He told me that, you know, 2 or 3 years before Snowflake had the most successful IPO of all time, he was turned down for funding, I don't remember, 24 times, 27 times, like, some stupid number of times because people just, you know, like, Hadoop they knew Hadoop was the answer. Or, you know, like, why would Snowflake be successful in a world that, that has their data and so on? So, so very long winded setup for to answer your question.

24:18

So our machinery is fundamentally relational. Okay? So we have a relational system. We organize data in the form of relations. They're a very normal using very, like, normalized data structures or schemas, so we call it graph normal form.

24:33

And, normally, with normal relational machinery built for OLTP or OLAP or big data, they wouldn't be able to support, you know, working on highly normalized data, and they wouldn't be able to support being running queries that have lots of self joins, which are important for, say, graph analytics or queries that are based on recursion, which are also important for reasoning and and other things.

24:56

But my colleagues have invented a new classes of joint algorithms called worst case optimal joint algorithms that make it, efficient to join together 10 20, 30, relations. We've invented new classes of query optimization called semantic query optimization that uses background knowledge about the domain that you have to actually make queries run faster by avoiding having to, you know, consider certain, options. The we've done that, in a cloud native architecture.

25:23

We've done that using and relying on very sophisticated incremental view maintenance algorithms so that if you change a record in a Snowflake table, you can update the materialized views that depend on that. Our our knowledge graph that that is that depends on that updated efficiently. So it's very important to be able to do a lot of things in in spreadsheet y style way.

25:44

We've invented, relational language features that make it possible to efficiently and easily express, these nontrivial things that are either impossible or unwieldy to express in SQL. And we've, you know, done other types of engineering to make it so that we have a relational database engine with these, obviously, these new data structures and architecture and joint algorithms and so on that can just natively answer a graph query in a way that a classic system cannot. So,

26:16

you know, it's really pretty basic. It's like these 5 or 6 innovations that you can combine together in various ways to get that expressive power and that support for those workloads that we just discussed. Okay? So it's not, you know, what people expect in a sense because they've been conditioned to learn that if you wanna do graph queries, you need navigational technology. Just like in the seventies, it was like, if you wanna do OLTP, you need to follow pointers around,

26:43

and so on. So, there's some real deep, you know, science here and and deep engineering, that pull together extends the power of the relational paradigm to these workloads that historically it hasn't handled. And as a brief side note to, the history of the Hadoop wave also is that by the time everybody else was starting to use Hadoop, Google who kicked off that whole trend had already moved past it and said, this doesn't work. We're gonna go on to Bigtable instead.

27:15

It's too hard. They started building SQL systems. Even the Hadoop community, it was, like, a little laughable. Like, it was, like, at first and and very anti SQL, and then they realized, you know, actually, this stuff really is is is is simplifies away a lot of complexity, and they all started building layers that made them look more and more like relational databases. But they had this sort of very complex on premise architecture underneath.

27:38

And, you know, like, to to me today, we talk to most people, Hadoop is kind of in the same equivalence class as cool. It's just just old legacy stuff that we have to run because, you know, we don't we can't afford to migrate it. But also,

27:53

lots of people are are now, like, in in the in the context of the sort of our customers and so on. The the the the engineers we work with and the executives we work with are being their bonuses are being attached to migrating off legacy, dupe systems and on premise non cloud native systems to to Snowflake and and BigQuery and systems like that. And so, yeah, there's still a lot more

28:17

going that way, and I think those companies and those products are very well positioned in this world as people migrate away from these very complex, very expensive systems. You know? Another interesting side note is the discussion around this relational paradigm, this relational model. My understanding also of some of the historical context there is that this idea of relational algebra is what kicked off the implementation of SQL in this whole wave of databases.

28:46

But that SQL was only ever a subset of the actual relational algebra that was proposed and also that graph systems are actually better for expressing relationships as opposed to the relational approach of SQL. I'm wondering if you can talk to some of the ways that that dichotomy manifests in your work with relational AI of maybe being able to extend the relational algebra beyond what is expressible in SQL?

29:15

Yeah. Exactly. We extended relational algebra with 2 things. Okay? Like, a multi way joint operator. So, like, in the relational algebra as defined by cod,

29:23

the operators were either unary or binary. Okay? And and stop me if this is more detailed than you want. Okay? So they either operate on 1 relation or 2 relations at a time. So if you're joining together 10 things, you would pick any 2 of them, join them together, and then you produce a temporary. And then you would then join to the 3rd, produce a temporary, join to the 4th. Right? So the the amount of work you have to do is proportional to the size of that biggest temporary.

29:46

Okay? So that that's this is a that's a binary joins are fundamentally limited this way. Okay? Now when you introduce multiway joins, worst case optimal joins, you're joining all 10 things simultaneously. Okay? So, and, you know, if you wanna, like, join 10 things to count at paths of length 10

30:06

with binary joins, you have to count all the paths of length 2. There are a lot of those. Then count all the paths of length 3. There are a lot of those. And keep going until you find just a few paths of length 10. Okay. So you do a lot of work that you end up throwing away because you're not interested in paths of length 234. Okay? With the worst case optimal join and with the semantic optimization, you just do work proportional to the paths of length

30:30

10, and you avoid doing all that work. So it's asymptotically faster in the same way that quick sort or merge sort is asymptotically faster than, bubble sort. Okay. So that's the the the best way to get speed is not to have to do work.

30:45

You know? If you if you have to do the work, obviously, you wanna paralyze. You wanna accelerate with GPUs, and you want it in memory and all that kind of stuff. But the best way is just to avoid doing the work altogether. Okay? And so, by extending the relational algebra with this multi way worst case optimal joint operator and with a fixed point operator that lets it do recursion, you make it more expressive. Because 1 of the knocks on relational algebras and and and the relational calculus

31:10

is that, they didn't capture all the algorithms. They don't capture all polynomial time algorithms. Okay? And you have to add something like a fixed point operator or recursion to get basically to simulate while loops, okay, to make it so that it's expressive enough to capture polynomial time algorithms. Okay? So that's really the fundamental innovation is we've gone you know, we've taken sort of,

31:34

Ted Cott's dreams and said, okay. Well, we couldn't he couldn't go all the way. We couldn't make it end to end relational because it wasn't expressive enough. It wasn't powerful enough, and it wasn't possible to do, like, certain things in an asymptotically optimal way. But you add these 2 operators and you add semantic optimization, and all of a sudden, you can express a lot more, and you can just take care of programming in in in a much more general way.

32:00

So, you can take all that procedural code that we write in as application logic and now formulate it relationally. And for the first time, you can define the whole application end to end relationally. You know, like, deep learning was a big breakthrough because it defined the model end to end, differentiably.

32:16

Okay? And so you didn't have to do the symbolic feature engineering upfront and then, you know, learn the parameters in the second stage. You're basically each layer of a deep network sort of learns higher and higher abstractions and and and learns, you know, how to represent, sort of the underlying features and so on. Well, there's a there's a symbolic version of that, okay, which is sort of end to end relational definition of your whole application.

32:39

And once you do that, you avoid the split brain architecture. You can now optimize across the whole thing. Okay? And, you know, 10 to a 100 x, times less work. In the same way that, you know, deep learning eliminated a whole bunch of feature engineering manual work, this kind of approach eliminates a whole bunch of, programming work. Okay? At least in the same way by analogy, let's say. So, yeah, it's very,

33:05

it's yeah. It's it's good to it's interesting to see it from the perspective of first principles like that. Okay? Not everybody will understand what I'm just what it is you said, but, there's a really deep contribution here to the foundations of the relational model. And the relational model to me is a superset of other models. Like, you know, people will represent graphs with adjacency lists and pointers. That's 1 representation.

33:30

People have will represent graphs with, tensors or matrices, you know, like, you know, having a matrix where if you have a 1 in a cell that sort of implies a connection between 2 nodes. The relational abstraction, I think, is the is really the nicest abstraction way of representing graphs. Because as you said, we often talk about graphs as, you know, defining relationships between 2 nodes.

33:52

And so, relation is the ideal abstraction, I think. And what's also really nice about the relation is you can now connect 3 nodes or 4 nodes. You can model hyper graphs and hyper edges. Okay. It generalizes very nicely beyond, you know, much more nicely than, say, a matrix representation or, especially a a pointer based or adjacency list representation. So it really is, the right abstraction.

34:16

And and because it lets you separate the the abstraction from the implementation, you can implement it with pointers or implement it with whatever you wanna you know, whatever data structures work best, but the abstraction hides all that from you, and you can operate more productively as a result. Result. And so for teams that are applying the relational AI platform to their problem domain, what does that onboarding and workflow look like for from saying, okay. We are going to

34:45

implement relational AI. Now I need to solve a problem where I need to generate some AI model to reduce the time to delivery, bringing the supply chain example to put to bear. Yeah. Yeah. Yeah. This is something we, you know, we do in in practice today. And I just just as a just make sure your audience understands, we're we announced private preview of relational AI, in June at the Snowflake Summit. Okay? And I expect that we'll be in public preview,

35:15

you know, sometime early next year and GA certainly by next summit. Okay? So, unfortunately, this is not something where your audience can just immediately go to our website and and start working with it. We'll we'll we'll have to find a you have to get in touch with us, and we'll get you on the list, and we'll we'll we'll try to engage with you, that way. But the the way in principle it works is, you know, if you're working in Snowflake, you we meet you where you are. The first step is to create mappings

35:41

from either accept default mappings or create your own mappings from the the tables that you have in Snowflake to a, graph normal form representation, which is, you know, what we call a relational knowledge graph is just a graph normal form relational database. And these this representation in graph normal form is gonna be implemented as a set of materialized views.

36:05

So in Snowflake, you might be used to idea of having different types of tables. Like, you have the base tables. You have dynamic tables, which are materialized views of a different kind. You have tables that you use for transactions. I think Unistore. I forget exactly what term they use. And so this is just a different kind of table that lets you represent, the knowledge graph.

36:25

And, when you have a need for doing, you know, something like the work as I described, you can either, via SQL interface, interact with the knowledge graph and get answers to graph queries, for example, like page rank or, you know, whatever your graph query or weekly connected components or whatever your graph query is. Those sort of those can be packaged up and and just sort of, taken care of, automatically.

36:49

Or you can, instead of calling into a stored procedural language, you would call into a stored relational language called rel that lets you, express more or more easily what you couldn't express in SQL to capture, for example, the properties of an integer programming model. Now what is an integer programming model? Running a programming model is usually a set of constraints and an objective function.

37:14

Well, databases have integrity constraints. They have views and they have so the set of constraints are just relational, integrity constraints. The mapping is almost 1 to 1. And the objective function is just a view you define that says, here's how I think about cost or risk or profit, some objective that you're trying to minimize or maximize. And then you tell the system, find me a relation, find me a table that you automatically populate with the price points that are gonna maximize my profit.

37:43

Okay. It's a very, very natural extension. Like, the correspondence between how you think about problems in an integer programming, linear programming sense and how you think about building models in a relational sense, it's like, again, you don't have to squint very hard and and see the same,

37:59

the same thing. So, again, graph queries are relational queries with self joins and recursion. You don't have to squint very hard to see the correspondence. A graph is a binary relation. A hypergraph is a n a relation depending on on the the edges. Tensors are relations that map integers to a value. Right? Like, you think about a 10 like a three-dimensional matrix or a tensor. You have I j k as the columns that are in the key that map to a value.

38:27

Now tensors can be dense or sparse. In the relational database, that's all abstracted away from you, and the internal representation can compress away, you know, zeros if you're sparse, for example. That's a very that's sort of the default setting in a relational database is assuming your data as sparse. And if your data is dense, you can then imagine putting data structures behind that that will, compress away the I j k because you don't need to store the the the the the indices when you're,

38:56

when you're dealing with data. Okay. So again, I I I don't know if this all made sense, but, it's a very, very natural correspondence between the tensor world and the relational world, linear algebra and relational algebra. I think relational algebra is a superset, and relations are a superset of tensors. Same for graphs,

39:14

same for things like, you know, linear programming and and integer programming. So with that concept of being able to build up these tensors as a graph representation, it brings to mind the, you know, the the other set of technologies that are gaining a lot of attention right now of vector databases being used as a context embedding method for being able to feed to large language models. That's what everybody wants to talk about right now.

39:42

And then there are systems like, Pathway, is another platform that is built as a streaming engine

39:51

where they are focused on being, on being able to build end to end LLM applications without having a vector store. I'm wondering what are some of the ways that the system that you're building at relational AI maps into the broader context of MLOps, where you're thinking about things like feature stores, vector stores, model training, model serving, just some of the ways that for ML teams who are trying to reduce the surface area

40:17

of systems and problem areas that they need to be able to focus on to get their job done. What are the ways that they will be leaning on relational systems like pinecone and other vector databases are very specialized technology to do basically nearest neighbor queries against this this this embeddings, these vectors. Okay? So,

40:46

and because they're they have, like, just 1 problem to solve, they can really, you know, focus on it, and they tend to solve it really well. And and there are many different approaches or systems out there that people can can can pick from. So but at their core, there

41:02

there's a query. Who are the nearest neighbors to this vector? And at the core is a vector, which has a a binary relation that maps an integer, that is the index to a value, you know, in that 512, row, if you will, or, you know, the spectra size of 512 or 256 or 1024 or whatever, sizes embeddings are these days. So it's just super, super specialized to that. Now we're not

41:28

in a position to to, you know, just to be as specialized for this workload. Like, you can you can do those types of computations in what we do, but they're not gonna be as efficient or scalable as, say, in a specialized technology at least for the time being. So we typically will want to incorporate some kind of vector technology into the mix. For things like feature stores, though, there's a ton of value that we can add. So 1 of our clients implemented a feature store, and they discovered that

41:56

and they're very excited about it. And, you know, they discovered that lots of data scientists were putting lots of features in a feature store.

42:02

But because there wasn't an easy way to navigate that feature store, there wasn't a knowledge graph or some kind of semantic representation of what's going on in these features. A data scientist will come into the feature store, do a keyword search or to not find what they want, not even be sure they know, you know, if they find something that sounds like what they want, they don't know how it's it's actually stitched together and so on.

42:22

So they would just end up rolling their own feature anyways and then putting it in the feature store. Okay. So it became they they called it, not my my my terminology, a feature swamp. And I just sort of accumulated more and more features. In the same way, other databases, you know, we have customers with a 180, 000, 000 columns of information. Like, we know banks that have over 400, 000, 000 columns of information. You can't convince me that there isn't a ton of redundancy

42:47

in a 180, 000, 000 columns of information. It's just different data silos developed by different people at different times, and they don't know that, you know, something has been, developed before because they don't have semantics. They can't query based on semantics, not just in the vector embedding sense, semantics in, like, here's the computation. Here's the the the definition of what this, this piece of information means. So building a knowledge graph on top of a feature store and taking each feature and being able to understand the ingredients

43:15

of that feature, like what data elements went into it. And then the recipe, how do you combine those data elements, Made it possible to, 1, you know, avoid having data scientists give up quickly and roll their own feature, and, made it possible to eliminate, obviously,

43:31

you know, a lot of redundant work, labor, but also computation. You know, people people don't love the the the monthly cloud bill that they get. But also storage. Right? And, again, reduce complexity in a way that you couldn't do if you didn't have a semantic layer or semantic understanding what's going on in your feature store. So, anyways, that's a couple of comments on your question there.

43:52

Another aspect of what you're building that I'm curious about is the design philosophy of how to strike the right balance between I wanna make this super easy for you to onboard and be able to do complex things like named named entity recognition, do that convergence of I have, as you're saying, 400, 000, 000 or however many millions of columns of data, maybe 20% of which are redundant. I wanna be able to automatically help you find that and coalesce that into a single representation

44:23

versus Yeah. Here are all the primitives for you to be able to do that, and maybe here's a recipe book for some examples and just some of the ways that you think about that design philosophy and product philosophy of what you're really driving for. Yeah. I mean, look. Before language models, we just have to acknowledge that a lot of this is, you know, fundamentally manual. Okay? You can help, you know, here and there. But I I think language models have really

44:48

changed my thinking on that. It's still early days. We're only you know, I learned about language models last December. So what is that? 10 months, 11 months? But, we've been doing a lot of work to try to sort of use language models to accelerate, like, with a human in the loop still, okay, but accelerate knowledge graph creation and being able to map those 100, 000, 000 columns or 200, 000, 000 columns to a much simpler ontology with much simpler set of relationships.

45:17

In some cases, being able to identify that, hey, this column is really just an aggregation of this other column, and it's not that hard. I can kind of figure it out from the data and from the naming convention. And so you can, you know, start eliminating, you know, like, you know, like, you you know, if you have a 100, 000, 000 columns of stuff, you probably have a 100, 000, like, you know,

45:37

columns that are input data. Everything else is probably computed from that. You know? So language models can accelerate the development of the ontology, the semantic layer, the knowledge graph, and even the development of capturing the semantics of how how something is computed from other things.

45:53

And the beautiful thing about that is once you have such an ontology, such a knowledge graph, that makes it, that's possible to use the language model to answer questions it couldn't answer because it's not grounded in truth, and it doesn't know how to access a 100, 000, 000 columns on its own. You know? So I do think there's a lot that's gonna happen here in the next 2 or 3 years to get give us a handle

46:16

on this mess of data that, you know, most companies have. Like, most companies have thousands of databases that drive many thousands of applications, all developed independently, all separating business logic and, you know, encode from data and data model. You know, the cloud and and systems like Snowflake made it possible for the first time. They're so scalable and made it possible to put all that data in 1 place. And now this next step is to re, I don't know,

46:45

realign that data or to simplify that footprint into something that, you know, a human being can can wrap their brain around because There's no way you can wrap your brain around a 100, 000, 000 things. You know?

46:56

And in your work of helping customers onboard under the platform, understand the capabilities that it offers, figure out how best to apply it to their problem domains, What are the elements of customer education that you find yourself spending the most time on or some of the aspects of the platform that are maybe overlooked or underutilized that you want to bring to mind as people are trying to figure out how to solve the problem given the tools that you're providing?

47:26

Yeah. Look. It's early days for us, and we, you know, building a system like us that sort of revisits, data management fundamentals from first principles, it requires a lot of energy in the algorithms and data structures and so on. And that energy has come at the expense of creating a more streamlined experience for users. Okay? So I would love to have more complete tooling, better tooling.

47:50

I would love to basically take the feedback that we've gotten from sort of early adopters about the language and when when, you know, what subset of it is accessible and and to make make make it just all much smoother than it is today. So it's smooth in the context if we meet you at Snowflake,

48:10

and we'll hopefully meet you in other places where people are over over time. But there's still a lot to do to create a tooling and an education framework and to be able to build on people's comfort and understanding of the relational model so that what we do, doesn't, you know, doesn't seem so new, you know, so new and strange. Okay?

48:32

So apologize in advance to some of the early users here who've helped us, you know, get that feedback that we need to to make to create a more, more complete and more usable experience.

48:44

So I think there's more work to do there. I think with the kind of platform that we build, again, combined with Snowflake and and cloud computing, you can start bringing together modeling environments that normally you would have thought of, you know, a separate, like, you know, modeling you know, you model, you build models in your databases, but you also build models in your spreadsheets. These 2 modeling environments have been, like, really disconnected.

49:07

With the technology that we're putting in place, you can imagine a more live modeling environment like you get in a spreadsheet where you're building the concepts and the relationships between the concepts, but you're also, like, doing that in an environment where there's data and it's running and you're getting instant feedback about

49:26

how, you know, input changes affect outputs and and and all of that. So, that's not a this year or next year thing, but that's a down the road thing for us that, you know, I'm excited

49:37

to to earn the right to get there at some point in the not too far in the not too distant future. So And recognizing that you are still relatively early in your journey, for the customers who are applying your engine for the problems that they are trying to solve, what are some of the most interesting or innovative or unexpected ways you've seen it used? Like, the stuff with the language models has been really, interesting.

50:03

We just did something with a a client who's working in supply chain management where also they they use our system to do simulation, which was also really interesting because you don't normally think about, like, writing a simulation in a in a relational technology. We had I was very surprised. We have a a client that is building tax applications. And, you know, in 1 subset of that, like, we replace something like a 100000 lines of c sharp with about 7 pages of relational rules. You know?

50:33

You know, it's not common. It's not like the default. Like, usually, you see 10 x reduction, 20, 30 x reduction, but that kind of compression was was really interesting to see. You know, we have hack internal hackathons, and people have built, like, games and, you know, nontraditional applications using systems like this. So, yeah, it's it's a pretty versatile model, and, it'd be it'd be great to see what people do with it once once it's in the hands of the masses.

51:02

And as you have been building this platform, building the business, what are the most interesting or unexpected or challenging lessons you've learned in the process? Look. It was 1 of the most challenging things. Like, when you build a business, you know, you're ultimately you're building the company, you know, but you're also building a product, and then you're trying to build a market or a category that, you know, that all fits in. Right? And

51:24

we've we've had sort of the the confidence around our product and the value proposition of our product, but we didn't know how to explain it to people. We didn't know how to get people interested in what we do. You know, some of the, you know, what we talked about today in terms of the fundamentals and the foundation, like, that's interesting to people with our background, but it's not interesting, you know, to a bank or a retailer or or whatever. And,

51:48

what the 1 of the best lessons I've learned is the importance of getting the go to market motion. Right? And really how to articulate how to meet people where they are and articulate the value proposition in a way that they get excited about, which might be for totally different reasons that I'm excited about. You know? And so this this is real.

52:08

I'm really, you know, happy with the response of the market to this positioning of us being a coprocessor to Snowflake because, like, who wants another database system? Right? Like but people who've made a big investment in a Snowflake or a BigQuery, they want that to do everything that they want. And so positioning our technology, not just, you know, database number 857, but as a way of augmenting a platform that people are adopting, that people love, like

52:35

Snowflake Net Promoter Score is like 70 plus. Okay? I mean, they they that's a a good solution, a good technology that people really appreciate. And then plugging into that in a way that people really just obviously understand that, you know, there's a whole bunch of pain they don't have to deal with anymore.

52:53

Okay? And a whole bunch of risk around security and so on and a whole bunch of processes like budgeting they don't have to deal with anymore. Alright? So as a technologist, I I, I hate to answer the question, you know, like, in a sense of, like, my best lesson is the business side is a business lesson and a go to market lesson. Because as a technologist, we all wanna have impact, and you can't have impact if you don't have a way of getting to users and buyers and so on.

53:19

So And for people who are interested in what you're building at relational AI, what are the cases where it's the wrong choice? Look. If you want something on premise, we're the wrong choice. If you want something like, if you're the kind of buyer that wants something, like, super polished, okay, we're not yet the right choice. Okay? We are earlier in our journey. Just depending, like, people have just sort of different philosophies around where they wanna be in the adoption, curve.

53:46

So we are newer, more innovative, more different, and so you have to factor factor that in. There will be, you know, issues related to that. But, again, if it's, let's say, it's 20, 17 and you're trying to make a decision between a Hadoop based thing or a Snowflake based thing. Right? In hindsight, what would you what did you do? What would you do knowing where we are today? Hadoop was more mature, had lots more users. It had more proof points than Snowflake.

54:19

And so you you could argue that in 2017, Hadoop was the right choice. But you can also argue that knowing that by 2020, Hadoop was the wrong choice, do you wanna pay that early adopter, tax, so that you you invest in a technology that's gonna be with you for the next decade or 2 versus a technology that's about to become obsolete. You know? So

54:39

that's the trade off. And as you continue to build and iterate on the product and the business for relational AI, what are some of the things you have planned for the near to medium term or any of the problem areas or projects that you're excited to to dig into? Yeah. So, again, get to GA on Snowpark container services or Sysnowflake. That's a priority. Make make the system just more polished and more easy to just get going with,

55:06

reduce some of the complexity around sort of modeling and so on. There is a workload that I would love to get to that we were working on in, like, in the lab, but not in, in production yet, which is around using graph neural networks to answer questions about data, you know, that you don't have. You know, like, the the example I used earlier about, like, what will my sales be next week? Right? You're it's a it's a graph completion problem. Right? And so

55:31

I would love to have, like, a deep learning technology for structured data. It doesn't it doesn't really exist yet. We have it for images, and we have it for text. And we've eliminated feature engineering, you know, for those types of problems, but we haven't really eliminated feature engineering for enterprise data yet. Yeah. And so, you know, graph neural networks might be a path towards that. And so you're using, you know, what you have in your database to predict

55:56

and, the things that you don't have in your database, you know, like things that are that are in the future, for example, or things that you can't measure. So, yeah, there's a ton of cool work that, we can we can

56:07

we'll be doing in the years to come. So Are there any other aspects of the work that you're doing at Relational AI, the problem space that you're addressing, the business, or just the kind of white spaces in the market that we didn't discuss yet that you'd like to cover before we close out the show? No. I think this is pretty comprehensive.

56:27

So thank you, Tobias. Thank you thank you for all the great questions. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today. Well,

56:45

it's clearly, I I would be inconsistent if I don't say the complexity of pulling it all together. Right? Just too many moving parts, too many paradigms, too many just it's it's mundane. It's tedious, but it's overwhelming. So that's, I think, a big 1. And then in the enterprise space, having to do manual feature engineering is a bear. It makes models very expensive to build, and so you don't get to build as many models and deploy in as many decision making,

57:12

cycles and loops and so on. Well, thank you very much for taking the time today to join me and share the work that you are doing on relational AI. It's definitely a very interesting product, interesting problem domain. It's great to see somebody tackling this space and making this, kind of discovery of data and,

57:30

building of intelligent applications a more tractable problem without having to deploy as many moving parts. So I appreciate the time and energy that you folks are putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias. Hopefully, we'll talk again soon. Thank you. Thank you for listening, and don't forget to check out our other shows, the Data Engineering podcast, which covers the latest in modern data management,

57:56

and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at themachinelearningpodcast.com

58:14

with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript