Big data is dead, analytics is alive - podcast episode cover

Big data is dead, analytics is alive

Oct 24, 202450 minEp. 292
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

Till and Adithya from MotherDuck discuss why DuckDB is revolutionizing data analytics, offering a unique, in-process SQL OLAP database that excels at fast queries even on local machines. They explain its advantages over traditional big data solutions, like Spark, and its seamless integration with AI workflows, supporting features like text-to-SQL, vector search, and AI-driven SQL query correction. The conversation also highlights MotherDuck's role as a cloud companion, enabling collaboration and dual execution between local and remote environments, making advanced analytics more accessible and efficient for developers.

Episode description

We are on the other side of “big data” hype, but what is the future of analytics and how does AI fit in? Till and Adithya from MotherDuck join us to discuss why DuckDB is taking the analytics and AI world by storm. We dive into what makes DuckDB, a free, in-process SQL OLAP database management system, unique including its ability to execute lighting fast analytics queries against a variety of data sources, even on your laptop! Along the way we dig into the intersections with AI, such as text-to-sql, vector search, and AI-driven SQL query correction.


Sponsors:

  • Fly.ioThe home of Changelog.com — Deploy your apps close to your users — global Anycast load-balancing, zero-configuration private networking, hardware isolation, and instant WireGuard VPN connections. Push-button deployments that scale to thousands of instances. Check out the speedrun to get started in minutes. 
  • Timescale – Real-time analytics on Postgres, seriously fast. Over 3 million Timescale databases power loT, sensors, Al, dev tools, crypto, and finance apps — all on Postgres. Postgres, for everything. 
  • Notion – Notion is a place where any team can write, plan, organize, and rediscover the joy of play. It’s a workspace designed not just for making progress, but getting inspired. Notion is for everyone — whether you’re a Fortune 500 company or freelance designer, starting a new startup or a student juggling classes and clubs. 

Featuring:

Show Notes:

Upcoming Events: 

Transcript

Welcome to Practical AI

Welcome to Practical AI. If you work in artificial intelligence, aspire to, or are curious how AI-related tech is changing the world, this is the show for you. Thank you to our partners at Fly.io. Fly transforms containers into micro VMs that run on their hardware in 30 plus regions on six continents. So you can launch your app near your users. Learn more at fly.io.

Sponsor: Fly.io

Hey friends, you know, we're big fans of fly.io and I'm here with Kurt Mackey, co-founder and CEO of fly. Kurt, we've had some conversations and I've heard you say that public clouds suck. What is your personal lens into public clouds sucking and how does fly not suck? All right. So public clouds suck. I actually think most ways of hosting stuff on the internet sucks. And I have a lot of theories about why this is, but it almost doesn't matter. The reality is if like...

I've built a new app for like generating sandwich recipes because my family's just into specific types of sandwiches that use Braunschweiger as a component, for example. And then I want to like put that somewhere. You go to AWS and it's harder than just...

going and getting a dedicated server from Hetzner. It's actually more complicated to figure out how to deploy my dumb sandwich app on top of AWS because it's not built for me as a developer to be productive with. It's built for other people. It's built for platform teams to kind of build.

the infrastructure of their dreams and hopefully create a new UX that's useful for the developers that they work with. And again, I feel like every time I talk about this, it's like, I'm just too impatient. I don't particularly want to go figure so many things out purely to put my sandwich app in front of people.

and I don't particularly want to have to go talk to a platform team once my sandwich app becomes a huge startup and IPOs and I have to like do a deploy I kind of feel like all that stuff should just work for me without me having to go ask permission or talk to anyone else And so this is a lot of, it's informed a lot of how we built Fly. Like we're still a public cloud. We still have a lot of very similar low-level primitives as the bigger guys.

But in general, they're designed to be used directly by developers. They're not built for a platform team to kind of cobble together. They're designed to be useful quickly for developers one of the ways we've thought about this is if you can turn a very difficult problem into a two-hour problem people will build much more interesting types of apps and so this is why we've done things like made it easy to run an app multi-region most companies don't

run multi-region apps on public clouds because it's it's functionally impossible to do without a huge amount of upfront effort uh it's why we've made things like the the virtual machine primitives behind just a simple api most people don't do like code

sandboxing or their own virtualization because it's just not really easy it's not there's just no path to that on top of the clouds so in general like i feel like and it's not really fair of me to say public clouds suck because they were built for a different time if you build

one of these things starting in 2007 the world's very different than it is right now and so a lot of what i'm saying i think is that public clouds are kind of old and there's a new version of public clouds that we should all be building on top of that are definitely gonna make me as a developer much happier than I was like five or six years ago when I was kind of stuck in this quagmire.

So AWS was built for a different era, a different cloud era, and Fly, a public cloud, yes, but a public cloud built for developers who ship. That's the difference. And we, here at Change, are developers who ship, so you should trust us. us try out fly fly.io over 3 million apps that includes us have launched on fly they leverage the global anycast load balancing the zero config private networking

hardware isolation, instant WireGuard VPN connections with push button deployments scaling to thousands of instances. This is the cloud you want. Check it out. fly.io. Again, fly.io. Welcome to another episode of the Practical AI Podcast. This is Daniel Whitenack. I am CEO at Prediction Guard, where we're building a private Securigen AI platform. And I'm joined as always by my...

Struggling with big data

co-host Chris Benson, who is a principal AI research engineer at Lockheed Martin. How are you doing, Chris? Hey, doing very well today, Daniel. How's it going? It is going great. I'm super excited about this one because it's a very, you know, we schedule a lot of shows and they're all interesting, of course.

Um, but occasionally there's like a show on a topic that intersects with something that I'm working on at the moment or something that I found that it's really exciting and, you know, found to be really useful. And so selfishly, um, I'm really. extra excited about this episode this week, which is with Till and Aditya from Mother Duck. How are you doing? Doing good. Excited to be here. Yes. And note, duck as in the bird. So editors, you don't have to bleep us out.

Sure, that's something that is an old joke for you all. I can pinpoint very easily how I ran across DuckDB. and mother duck is there is a blog post um the title is very simple it said big data is dead and immediately when i saw the title i was like Thank goodness, finally. But I'm wondering if you can maybe just kind of step back. It doesn't necessarily have to be the points in that blog post, but how you see the kind of...

data analytics, big data, AI intersections as of now? And what are the sort of concerns and issues that people are thinking about that is driving them to DuckDB and then of course we'll obviously get into DuckDB and MotherDuck and all that you're doing but setting that stage of you know what are people struggling with what have they realized in the past about this sort of

big data hype in one way or the other, positive or negative? And how has that kind of changed the way that people are thinking about analytics and databases? I can tell a story about how I got in touch with StackDB. It started at the very beginning of the DuckDB project. I was actually doing my master's thesis back then at the CWI where DuckDB originated from.

After I graduated, Hannes, who is the developer or the founder of DuckDB Labs, reached out and we were talking and he was saying, hey, we're working on this new project. We're working on this database system. Are you interested in like maybe joining, maybe working on it? But I was very focused on machine learning and stuff like this. So I wanted to go into data analytics, data science, these kind of things.

A year later or so, I was working at a telco company and we were analyzing customer data with Spark and so on. One day, one of the first versions of DuckDB was released, so I pip installed it and I run the first simple aggregation query on maybe 100 megabyte data set or something like this.

I was surprised because I thought something was going wrong. I thought it's impossible that it just did the aggregation, right? Because from working with Spark, I was so used to, okay, now Spinner's turning for 10 seconds at least. Right. And then that was really eye opening. And I've heard similar experience from a lot of people. Even until today, I hear very similar stories and experiences. Yeah.

For me, it started in a different way. I first figured out DuckDB Wasam existed, that you could run an analytical engine in the browser. And to think about something like that was super crazy. the kind of stuff that you could do on top of it started to look super crazy. And one of the things that I was super excited about when DuckDB Wasom released was the possibility to do geospatial analytics. So back then when I started...

My first encounter with DuckDB was doing geospatial analytics. And then to think about that could actually be done in the browser was mind-blowing. And that's when my journey into DuckDB started. So let me ask y'all a follow-up question as you're diving into your passion. For those out there who may be listening who are not already familiar with it and they're hearing...

This is the way

database they're hearing big data is dead they're hearing doing this in the browser give me a little bit of background on kind of the the ecosystem that you were that you were coming from a bit and also what this idea was so that people can kind of follow you into that what is it that caught your passion and attention and made you say ah this is the way and assume somebody it doesn't already have a familiarity with it so

I guess I was going into this coming from the machine learning side of things. So I was used to working with scikit-learn pandas or the Spark equivalents to that, like Spark ML.

building data prep pipelines in there and so on and so forth. And then encountering this DuckDB thing suddenly that apparently is doing aggregations of... the sizes of data I was working with much, much, much faster, sparked some fantasies around, hey, how much of the data preparation pipeline can we push into DuckDB, actually?

This idea or this fantasy has been following me for the past years, and I think it's still an exciting topic. To follow up a little bit on that, the way that large data or big data has been analyzed in the last years... predominantly that you required some server in the cloud, you required resources that were not local to be able to perform large analysis. But something that DuckDB opened up that made possible was to use local compute.

in your local MacBook, for example, was to utilize that compute at the most to perform this kind of huge analysis. And that, I guess... sets spark to a change in the ecosystem, I would say. And I guess that's where we're at. i resonate so much with this so like coming from a background also as a data scientist living through the years of like being told hey you know use spark for this like basically my experience

in this sort of ecosystem was like, I would try to write a query and it would get the right result. But to your point, Till, I would just be waiting forever to get a result. And so I'd have to send it to some like... other guy whose name was Eugene. Eugene was really smart and he could figure out a way to make it go fast. And I never became Eugene. So I resonated with this very much. And the fact that this concept of, hey, there's these seemingly big data sets out there and I want to do.

maybe even complicated analytics types of queries over these or even, you know, execute workflows of, as you mentioned, till aggregation or other processes. at query time i could do that with a system that i could just run on my laptop or I could run in process is really intriguing. So maybe now is a good time then to like introduce DuckDB formally. So like I'm on the DuckDB side, it says DuckDB is a fast in process.

Introducing DuckDB

analytical database. So maybe one of you could take a stab at thinking about those data scientists out there who are maybe at the point of also not believing that what we just described is... maybe possible or they're living in a world where that's not possible describe what DuckDB is and maybe why that becomes possible as a function of what it is

I think I can talk a little bit about the motivation behind DuckDB, or at least the way I perceived it at the time. And that was actually originated from the R ecosystem. Yeah, so Hannes was very involved in that ecosystem and people were using R to essentially crunch relatively large data with relatively... primitive methods. And so at the time, CWI had a database system and an analytical database system called MonetDB that has

incorporated the idea of vectorized columnar query execution. It was a large system that was not really easy for the typical R users. adopt. So the first idea was to say, hey, let's maybe build a light version of MuneDB and integrate it with, I think it was dplyr.

or something like this, and we just let it run on the client. But eventually, it turned out to be easier maybe to just rebuild the database system from scratch that was actually... designed to run in process, to be super lightweight, super easy to install, and everything essentially to give the power of this vectorized query execution into the hands of data analysts.

I'm wondering if you could, when you talk about that being in process and lightweight, could you describe what that means for someone that may not be familiar with the term in process? And how is that different from other databases?

What is in-process?

that are not in process, that have their own processes. Can you describe a little bit of what that means? So classical database systems operate in the client server architecture. Usually you have a database server running somewhere and you have a client. that sends SQL queries essentially to the database server and then the result is transferred back to the client through some kind of transfer protocol.

One paper that Hannes and Mark, who is Mark Brassfeld, who is also co-founder of Dactylapsis, they were working on a paper that basically benchmarked these client protocols. And it turned out that that was actually a huge bottleneck. So even when you're running Postgres on your local machine, you still have this client-server protocol bottleneck. And the way to get around this is to have the database actually running within your process.

that is, you know, in that case, maybe R or Python and has access to the result set just in memory. No transfers to happen. And maybe I'd like to just add in that for those who maybe haven't done programming and stuff in our audience, that when it's expensive to go between processes.

And so that database server in a different process, it takes a lot of resource to go from the process you're in off to that and back. And so this puts it all into one, you might say, one little sandbox where you're able to maximize that. Would that be a fair?

assessment yeah yeah so i think one of the other advantages of having this type of a model is that you can share memory between the processes so just to go a little bit inside the technical aspects of this is that the bottleneck that till was explaining was more like

the data transfer bottleneck but in this case when it's running within the process you can you can share the same memory you can share the variables that are uh that you're crunching inside let's say a python script that you're crunching a variable

And then you have access to the variable inside your database as well, for an example. And this makes it super powerful for the developer, for the developer experience as well. And I guess one of the things that, apart from the database itself being super fast, The developer experience of using DuckTV is so awesome in that sense that I guess that has also led to the success of it.

Sponsor: Timescale

Okay, friends, I'm here with a new friend of ours over at Timescale Avthar. So Avthar, help me understand what exactly is Timescale? So Timescale is a Postgres company. We build tools in the cloud and in the open... Okay, if our listeners were trying to get started with... Postgres, timescale, AI application development, what would you tell them?

What's a good roadmap? If you're a developer out there, you're either getting tasked with building an AI application or you're interested and you're seeing all the innovation going on in the space and want to get involved yourself. And the good news is that any developer today can...

become an AI engineer using tools that they already know and love. And so the work that we've been doing at timescale with the PGAI project is allowing developers to build AI applications with the tools and with the database that they already know.

being Postgres. What this means is that you can actually level up your career, you can build new interesting projects, you can add more skills without learning a whole new set of technologies. And the best part is it's all open source, both PGAI and PGAI.

VectorScale are open source. You can go and spin it up on your local machine via Docker, follow one of the tutorials on the Timescale blog, build these cutting edge applications like RAG and search without having to learn 10 different new technologies and just using

Postgres and the SQL query language that you will probably already know and are familiar with. So yeah, that's it. Get started today. It's a PGAI project and just go to any of the timescale GitHub repos, either the PGAI one or the PG vector scale one. follow one of the tutorials to get started with becoming an AI engineer just using Postgres. Okay, just use Postgres and just use Postgres to get started with AI development, build RAG, search AI agents.

And it's all open source. Go to timescale.com slash AI. Play with PGAI. Play with PG Vector Scale. All locally on your desktop. It's open source. Once again. timescale.com slash ai So Aditya, you were just describing the developer experience, which I would definitely say is kind of fitting that magical experience that you alluded to with DuckDB.

Uses for DuckDB

maybe just to give a sense of people like you know When I was initially exploring this, similar to some of the experiences that you all talked about, I would encourage our listeners to go out and install DuckDB locally and try something because it is a really interesting experience, especially for those that... have worked with traditional database systems in the past and all of a sudden so um you kind of install duckdb locally import it as a library then you can query you know point to

CSV files or JSON files or Parquet files or even a database like a Postgres database or data stored in an S3 bucket and you have this consistent then SQL interface that's familiar that you can do. queries over that data. So I don't know, maybe one of you could describe some of the... Just to give people a sense of the use cases for DuckDB, maybe on one side where...

It's like the primary or the key or the most often occurring use cases that you see people grabbing DuckDB and using it for. And then maybe on the other side, just to kind of... help people understand where it fits maybe where it wouldn't be as as relevant um if you have any of those thoughts i can give like a brief overview of this Some of the biggest users of DuckDB come from the Python ecosystem, which means that it's being a stand-in for a data frame, for example.

One of the advantages of using .db is that it's really fast on aggregates. And for the Python ecosystem, it helps with standing in for a data frame to be used with other ML libraries, for example. So that's like one part of the ecosystem. And the other part of the ecosystem is for a data engineer to be able to pull in data from different sources, like you said, you know, Postgres, from CSV, and to be able to join those different data sets.

Joins are really good with DuckDB as well. And to create transformed data sets is also pretty useful. And on the third ecosystem for a data analyst who is writing SQL. And one of the really nice aspects of DuckDB is the SQL dialect itself. It's pretty flavored that you have a lot of DuckDB functions that makes data cleaning easy, data transformation easy.

For example, we also have a dialect that says from table, and that's just going to show you the table. Instead of going select star from table, you can go from table and that will just fetch data from that table. There are these flavors of dialect for Duck TV that makes it nice. You know, I was also looking through the DuckDB website and stuff, and I know it runs on kind of all the major platforms and architectures and you support a variety of languages on it. I'm curious, because I have a...

Embedded environments & the edge

asking a question to my own, my own interest selfishly, as Dan would say, do you support kind of embedded environments and kind of, you know, on the edge, that kind of stuff where you find it embedded and operating where it's not. necessarily on a cloud server on one of the major platforms? Is that a typical use case? That is one of a good use cases for DuckDB. Since it's the in-process protocol that it has for running DuckDB.

it can run wherever you run python or r or anywhere so and they've also optimized it to run in different architectures as well so so this makes it possible and to kind of go beyond that you can also run it in the browser. So any edge environment, you can run it. Of course, there's a lot of optimization for, there are like a lot of edge environments at the moment. Not everything is optimized.

to run DuckTV. But I guess it's also moving towards being run in every edge environment as well. Some of our listeners might be curious why you know, a person like me is sort of living day to day in the AI world is thinking.

Standardizing fast interfaces

is super excited to talk about DuckDB. I mean, certainly I have a past in more broadly data science, and this is pain I've felt over time. But also there's a very relevant piece of this that intersects with the... the needs of the ai community more broadly and the workflows that they're executing and one of those you know is where i kind of started getting into this is in these sort of dashboard killing AI apps that people are trying to build in the sense that like, hey,

another pain of mine as a data scientist in my life is building dashboards because you always build them and you know they never answer the questions that people actually have and so there's this real desire to have like a natural language question input and then you can then compute very quickly the answer to that natural language question.

by using the LM to generate a SQL query to a number of data sources. But then when you start thinking about, oh, well... now I have these CSV files that people have uploaded into a chat interface, or I have these types of databases that I need to connect to, or I have this data in S3 buckets, and my answer could come from these different places. all of a sudden this kind of rich SQL dialect that you talked about that's very quick and can

run with a standardized API across those sources becomes incredibly intriguing for me. Transparently, that's how I sort of like got into this is I'm like thinking of. all of these sources of data that i could answer questions out of using an llm but how do i standardize a fast interface to all of these diverse sets of data and also do it in a way that doesn't you know is easy to use from a developer's

perspective. But I also know that you all see much more than I do. And maybe that is an entry point that you're seeing. I'm wondering if one of you could talk a little bit more broadly of how the problems that... DuckDB is solving and the problems that your customers are looking at are intersecting with this rapidly developing world of AI workflows. I mean, one way to describe DuckDB is it's the SQLite for analytics. So it is basically a very...

easy way, a very developer friendly way to achieve what you just described. If I want to create a demo for my new text to SQL model, if I use DuckDB for it, I can even make completely like wasn't based demo out of it, for example. I don't have any issues with CSV upload. There might be databases where I have to specify.

delimiter of the file that the user uploads. So I would have to show a dialogue to my user where he says, oh, that's comma separated and it has a header row and so on. With DuckDB, it just works. it takes away some of the edges you might have with other databases. And on top of that, as you said, it integrates with different storage backends like it can read from S3, it can read from HTTP.

When I see an interesting file on, let's say, Hugging Face or GitHub, I just run read CSV from this URL and I have the data set locally in my CLI or in my Python. Furthermore, when I have, say, a Python environment, I start a Colab notebook, right? And I create some data frames. Then with DuckDB, I can just... read those data frames. I've seen very cool demos of people basically using text to SQL for analytics on Pandas data frames.

And under the hood, it's just DuckDB sitting there and basically reading straight from those Penas data frames, which, by the way, is one of the other benefits of shared memory. of in-process, it's not only for fetching results, it's also for reading data straight from the process. So in that case, from Pandas. That's very exciting. I'm happy to talk more about

takes a sequel. We have had a project about that at Madadak. But yeah. Yeah. And maybe also before we get into maybe some of those stories.

Vector search

I think that that's one side of it is like the integration of this analytics piece into AI workflows. But then also, if I'm not mistaken, there is sort of vector search. capabilities within DuckDB as well. I don't know if one of you could speak to that. Yeah, that's one of the exciting aspects of DuckDB as well. So if I could take a step back and think about other ecosystems where, let's say, Postgres.

has been shining a lot. Postgres has exploded into the kind of possibilities that you can do because it has kind of like an amazing extension mechanism where you could add extensions and capabilities of Postgres. In a similar way, DuckDB has an extension mechanism that you have access to the internal workings of DuckDB and you could add more workflows on top of what DuckDB can do.

DuckDB has these capabilities of doing vector search, for example, and it also has a hybrid search where you also have full text search and vector search that you could put together to

to create hybrid search. One of the ways it does is that it has a really nice data type. I can go into the rabbit hole of the inner workings of how they make this happen, which is also pretty exciting. But one of the things that they make this possible is to provide an array data type where you can have an array of floating points and then you can store this as a data type and then that eventually becomes an embedding vector that you can do.

cosine similarity against so that is to do like an embedding based search then you can also have full text search where you can create a inverted index of keywords to your documents and you can search across your keywords to find your ideal documents and rank them according to the score and then you could fuse both of these scores from embedding search and from

full text search to have like a hybrid search so yeah so all of these are possible and they're very accessible Well, there's no shortage of helpful AI tools out there.

Sponsor: Notion

But using these AI tools means you got to switch back and forth, back and forth between yet one more tool. So instead of simplifying your workflow, it just gets more complicated. But that's not how it works when you're using Notion. Notion is the perfect place to organize lots of stuff tasks, tracking your habits, writing beautiful docs.

collaborating with your team, knowledge bases, and the more content you add to Notion, the more this cool thing called Notion AI can personalize all of the responses for you. Unlike generic chatbots, Notion AI already has... context of your work. Plus, it has multiple knowledge sources. It uses AI knowledge from GPT-4 and Cloud, and that helps you chat about any topic. And here's the kicker. Now in beta, Notion AI can search

across Slack discussions, Google Docs, Sheets, Slides, and even more tools like GitHub and Jira. Those are coming soon. And unlike specialized tools or legacy suites that have you bouncing between different applications, Notion is seamlessly integrated, infinitely flexible and beautifully easy to use. So you are empowered to do your most meaningful.

work inside notion from small teams to massive fortune 500 companies these teams both small and large use notion to send less email cancel more meetings, save time searching for their work and they reduce spending on tools, which helps everyone stay on the same page. You can try Notion for free today by going to notion.com slash practical AI. That's all lowercase.

notion.com slash practical ai to try the powerful easy to use notion ai today and of course when you use our link you're supporting our show and i know you love that again notion.com practicalai. So, Till, you're starting to get into even some of the things now that you're doing at Mother Duck on top of DuckDB.

Bringing MotherDuck to the cloud

I'm wondering, hopefully we can get to some of those use cases or the things that you've been doing with customers or internally. But I'm wondering before we do that, I see... Also, this sort of story about DuckDB's efficiency, but with this kind of multiplayer aspect as part of what you're doing at DuckDB. So maybe one of you could describe kind of.

Now I think we have a sense of what DuckDB is, and it's this free thing that is... open and i can pull down i can install i can run it very quickly run it on my laptop run into my browser do these analytics queries so now kind of describe maybe a little bit of how you're taking that that further with Mother Duck and how you're thinking about some of the enterprise use cases. I like to describe Mother Duck as giving your DuckDB a cloud companion.

So it's easy to think or to associate, okay, we bring Matadak to the cloud, which is one way how we describe ourselves as well. To associate that with, we provide infinite scale up in the cloud. You give us a workload and we start how many hundred DuckDBs in the background that in a task-like fashion.

let's say, process your data concurrently. But actually, one of the hypotheses that Mother Duck is based on or that the company was founded on is that actually... single node compute, which means one dark DB database with nowadays hardware, cloud hardware is actually actually gets you very, very, very far. When your local compute resources reach a limit, you have cloud cloud, single cloud instances with up to, how much is it? 24 terabyte of memory.

That's relatively big data. So that's one aspect, right? So scaling up with one cloud company in DuckDB. Another aspect is collaboration. So once you are connected to a cloud instance, you can have shared context with other users in your organization. You can create shared data sets. You can have shared notebooks. and so on and so forth. And with that, of course, comes all the enterprise SOC 2 kind of things that some of the enterprise customers require to adopt tools like Dactivity.

I'm curious if you could, that you really captured my imagination with that, that description. And so like, because, you know, by drawing, for instance, with kind of, you know, the old school Postgres things that people would do with that.

Problems solved

And you just talked about having many DuckDB instances operating concurrently. What kinds of problems, grounding it in a practical way from a user's perspective, what kind of problems... Do you see people solving with that kind of architecture and that new capability that they may not have historically had over the years with previous?

database capabilities on other platforms. What new sets of concerns can they address now with those? I would come from the perspective on this that there are a lot of companies out there that When they want to go to the cloud with their analytics workload, they have relatively limited choices. One of those choices is Snowflake or Databricks.

they, of course, those systems are optimized for big data scale. But then one of our observations is that a lot of companies actually don't have that amount of data. when they run queries or they might have big data, but the queries they are running only access a very small subset of the data, for example. You run monthly reports. they don't touch your entire historic data set. So those companies might want to have something that is first easier to use.

easier to set up, and that's also more cost-efficient than other existing solutions. One of the things that we haven't touched upon in this yet is kind of how MotherDuck and DuckDB go hand in hand. with like the remote and the local aspect where you have on your local and your remote the same client so that you could actually you're running the same thing.

So it's easy to go from one place to the other doing the same thing. And what Mother Duck also provides is a dual execution where your local DuckDB, if you're running it locally, can communicate with your remote Mother Duck. and execute seamlessly between both. For example, a query where you have a table in your local DuckDB and you want to join it with a remote DuckDB, you can join both of these tables together.

to run an aggregate and then there's like a query optimization that we run where we transfer the data which was required from the remote to your local or from your local to remote and execute it intelligently in a way, if I could say that. And this kind of opens up new opportunities in the dual execution aspect of running the local and the remote with the same client.

I'm curious, again, selfish question, is you're doing that and you have the local version and the remote version, the connection between the two there, you know, what does that look like? Is it something that if they're widely separated, if, you know, mother ducks in?

Cloud & local compatibility

the cloud and I'm out on a device that's not cloud-based. Is that efficient communication? How do you all handle those different types of use cases? Yeah, so one of the principles of this dual execution is to reduce the amount of data that has to be transferred as much as possible. One of the use cases, for example, is I have a really large dataset on S3. And I want to join it with a small table that I have on my notebook. So in that case, an optimizer, query optimizer.

will make the decision to instead of downloading the one terabyte data set to a local device and doing the join there to instead upload your small local file to the. to the cloud worker and do the processing there so that saves in that case a lot of bandwidth the same with um you know filter push down i query a large data set on s3 again and

the transfer only has to happen for the filter. And you can get something similar with DuckDB as well if the data is partitioned. So DuckDB has clever ways to optimize remote file access as well. without MotherDuck. But the thing you get with MotherDuck is it even filters the data if your data is not partitioned because the cloud worker still takes care of doing the bulk of the work and only gives you the results you actually want and need.

A lot of what we've talked about are the features of DuckDB and then what Mother Duck is adding on that and also how that intersects with AI workflows like the text to SQL case or the RAG case where we're doing.

What makes it appealing?

you know vector or semantic search or we're doing hybrid search all of those things are super relevant to people building their ai workflows but i also find it interesting that i i see um Till you wrote one of the blog posts that I'm looking at now, which is like, you're also thinking as a company about how to use AI intelligently in your own product as well for the...

uh users of your product who are maybe technical users they're building their own workflows but also you have sort of ai integrated into some of the features of that i'm looking at this fix it feature. So I'm wondering if you could talk a little bit about that, how this is both you're enabling AI developers, but also you are

definitely integrating this technology as well. At least that's how it seems. Yeah. As Aditya mentioned, one of the big appeals of DuckDB is the simplicity. That's what brings a lot of users to DuckDB. I think that simplicity can be extended towards usage of AI to a certain extent, like usage of AI in the context of data analytics, data management. And there are multiple aspects to that.

On one hand, there's user experience side of things. So how can we make it easier for Peerable to write SQL? And I think the answer to that is not only text to SQL. And part of that story is fix it. So one of our main aims with fix it was to keep it, basically make it non-intrusive and not interrupting your flow of writing SQL. while still being helpful when it triggers. And I think Cursor, for example, is an excellent example of integrating AI into...

or into the workflow of software developers. And in our case, we have to think more about data engineers and data analysts. And I think it's a super exciting time for those kind of things. I think Mother Duck is a particularly interesting place to work on those kind of things because one of the unique advantages that we have is we have an actual database running on the client side.

in the browser of the user. If someone is using our web UI, that user is actually DuckDB running in their browser that can do parsing, binding. And that gives us so much information about the current state of the query that the user is writing and fixes it only, like scratches the surface of what is possible in terms of... SQL writing assistant in that sense.

So I'm curious, as we start winding up, you really got me thinking about use cases that I had not thought about before and all the things I might be able to do here. So I'm a little bit like a kid in a candy store.

New cool things

I'd like each of you to take a swing at it. It's pretty cool what you've talked about today in terms of what is possible for us. How are you thinking about the future? what are the new cool things that you have in mind? You know, I often say like when you're, when you're kind of not necessarily working hard on a problem, but you're kind of chilling out at the end of the day and your mind's just wandering in free form and you're thinking.

boy, what if we could do this? I could imagine that, and I can kind of see a path forward to get there. How are each of you thinking about Mother Duck and DuckDB in terms of what the future might offer? If you want to kind of get out there and wax poetic a little bit, and it doesn't have to be grounded in current work.

or in imagination and aspiration? One of the things that I really like about the current state of AI is how good the local models are, the small models that you can run locally. And there's a great ecosystem out there building on top of that. One of the things that I see with the local models, of course, they hallucinate, but to prevent hallucination, you can use a really nice rag mechanism to put context into those local models.

And these local models could be on the edge as well. It could be on your local laptop. It could be on the edge. And knowledge bases are essentially created to kind of prevent these kind of hallucinations. One wasteful aspect of creating knowledge bases is that everybody's creating very similar knowledge bases. And what if there could be a mechanism where we could share these knowledge bases?

a user could create a knowledge base and they could share a knowledge base. And one of the imaginative worlds that I've driven is how Matada could be there to do these kind of shareable knowledge bases where you essentially have a world of remote knowledge bases out there in your remote tables. And then you have a local DuckDB client there that helps you pull a knowledge base that you want, use the local knowledge base, augment your local model with, you know.

the relevant context for your current question. And then when you don't want the knowledge base, you could also drop the knowledge base. And that's like, you know, having a remote knowledge base repository and pull whatever you want. This is one of the dreams that I think about how MotherDuck and DuckDB could be useful for this. And another aspect of talking about knowledge bases and RAG applications is that...

Not all applications and workflows require a real-time database to build agents on top of them. And some of these agents could be running as background agents that do some workflow once every day. And instead of having a real-time database for that, what if you could provide a very lightweight analytical engine that's quite cheap to run locally as well? And that could also, you know, you could offload some work to the remote cloud.

So this is another thing that keeps me excited at night to think about what could be these kind of use cases, which these are the two use cases that I am quite excited about. Yeah, I mean, maybe I can add two things. The one thing that actually connects to that is bringing AI and machine learning capabilities more into the database. So one of the things...

we've seen in the past is that the inference costs of language models have dropped quite significantly compared to two years ago. It's now, I think, only 2%. of the price for inference with GPT-4 mini compared to GPT-3. And that actually makes it possible to run language model inference on your tables. and also to do things like embedding compute on your tables.

SQL is just a really convenient user interface for that. So we added this embedding function some time ago that works really well together with the vector search. So you can basically do embedding-based search only in SQL. now we're adding the prompting capabilities so you can do language model based data wrangling in your database and that together with local models yeah and this hybrid execution model we say okay we do part of the work locally

Maybe if you have a GPU, do part of the embedding inference locally. If you want to do it faster, do it in the cloud with a few A100. And again, everything is in SQL. That's awesome. Yeah, well, thank you both for taking time out of your analytics, AI database work to come talk to us. This has been been super amazing.

Thanks for joining us!

And I would definitely encourage people out there, please, please, please go try out some things. Try out some examples with DuckDB. Check out the Mother Duck website and some of the great blog posts, content that they have there, examples or things. Check it out because it's definitely a really wonderful thing that you can add into your AI stack and think about and experiment with. So thank you so much, Till and Aditya, for joining. It's been a pleasure. Thank you, guys, for having me.

Thank you, guys. It was pretty awesome to be here. All right. That is Practical AI for this week.

Outro

Subscribe now. If you haven't already, head to practicalai.fm for all the ways. And join our free Slack team where you can hang out with Daniel, Chris, and the entire ChangeLog community. Sign up today at practicalai.fm slash community. Thanks again to our partners at fly.io, to our Beat Freaking Residence, Breakmaster Cylinder, and to you for listening. We appreciate you spending time with us. That's all for now. We'll talk to you again next time.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android