Arjun Patel on Vector Databases and the Future of Semantic Search

⁠¶ Arjun Patel: Bridging AI & Education

00:00

Welcome back to Data Driven, the podcast where we chart the thrilling terrains of data science, AI, and everything in between. I'm Bailey, your semiscient host with a pangshang for sarcasm and a wit sharper than a histogram spike. Today's episode promises a delightful mix of the analytical and the artistic as we dive into the fascinating world of vector databases, retrieval augmented generation, and origami. Yes.

00:26

You heard that right. Origami, the ancient art of folding paper, somehow finds itself intersecting with AI, proving that the future really does have layers or should I say folds. Our guest, Arjun Patel, is a developer advocate at Pinecone who's on a mission to demystify vector databases and semantic search, turning complex AI concepts into snackable bits of brilliance. He's also a self taught origami artist and a

00:52

former statistics student who actually enjoyed it. So if you're ready to unravel the secrets of modern AI and maybe pick up a trick or two about folding life into geometric perfection, you're in the right place. Hello, and welcome back to Data Driven, the podcast where we explore the emergent fields of data science, AI, data engineering. Now today, due to a scheduling conflict, my most favorite is data engineer

01:19

in the world will not be able to make it. But I will continue on, despite the recent snowstorms that we've had here in the DC Baltimore area. With me today, I have Arjun Patel, a developer advocate at Pinecone, who aims to make vector databases retrieval augmented generation, also known as RAG, and semantic search accessible by creating engaging YouTube videos, code notebooks, and blog posts that transform complex AI concepts

01:49

into easily understandable content. After graduating with a BA in statistics from the University of Chicago, his journey through tech world stands spans from making speech coaching accessible with AI at Speeko to tackling AI generated content detection at Appen. Arjun's interest spans traditional natural language processing into modern large language model development and applications. Behind beyond his technical prowess, Arjun has been designing and folding his

02:19

own origami creations for over a decade. Interesting. Seamlessly blending analytical thinking with artistic expression and his professional and personal pursuits. Welcome to the show, Arjun. Hey. Nice to meet you, Frank. Thanks for having me on. Excited to be here. Awesome. Awesome. There's a lot to unpack from there, but I think it's interesting to note that you have a BA in statistics. Yes. So you were probably studying, this sort of stuff before it was cool?

02:45

Yeah. Yeah. A lot of the old school ways of analyzing data, understanding what's going on, so on and so forth. It was kind of, like, made clear to me pretty early that understanding how to work with data at small scale and at large scale is gonna be very important going to the future. So I kinda just took that and ran with it with my education. Very cool. It was definitely, you know, one of those things where I don't think people realized how important statistics would be until,

03:15

you know, until the revolution happens, so to speak. So and it's also interesting to see because there's a lot of people that I think could benefit from, you know, picking up that old picking up a, an old statistics book and reading through it and understanding, like, a lot of the fundamentals. Obviously, there's a lot of new things, but a lot of the fundamentals are largely the same. You know, just I'll use this example. You know, McDonald's can add a Mc McRib sandwich,

03:41

but it's still a McDonald's. Right? Like, it's This is what happens when you're shoveling snow. Like, your brain gets I absolutely agree. And, like, another proof on that point is that Anthropic just released a blog recently kind of recapping how to do statistical analysis when you're

04:00

comparing different large language models. And when you read the paper in the blog, it's basically just like 2 sample t tests and kind of going over really, like, not introductory, but still statistics that's easily accessible for people to learn and understand. So it's still relevant, and it's still important. Interesting. One of the things that that that stood out in your in your bio was, people tend to forget that there was a natural language processing field prior to chat gpt launching.

04:31

How do you, you know, we wanna talk about the difference between those 2? Sure.

⁠¶ Traditional NLP and Geometric Models

04:40

So the one of the first and probably only course I took in college related to natural language processing was called geometric models of meaning. And everything I learned in that course was like everything before, what we now would consider, like, modern embedding models. So bag of word methods, understanding how to represent documents and text purely based on, like, the frequency of the words that exist in the text,

05:06

and then trying to understand, like, okay. Based on that information, how can we learn about the concepts that exist in text from the words that are being used? Like, what is the framework we can use to understand what these words mean based on their, co occurrences with the other words and texts that you're working with and based on, what those words mean as well. So, like, what the words' neighbors are and what their meaning

05:28

helps and also what those words are doing. And I think a lot of traditional natural language processing, methodologies kinda stem from that, and there's a there's a lot of mileage you can get out of just thinking about approaching problems there before you step into these more complicated methods,

05:43

like, these embed modern embedding models that exist. So that's kind of, like, what I would consider, like, traditional NLP, like, doing named entity recognition, trying to understand how to, find keywords really quickly. And then once you get really good at that, there's a whole host of problems that you encounter afterward that kind of modern techniques try to

06:02

solve. Right. That's interesting. So so what was it, what was your thoughts when you first, like given that you were an NLP practitioner prior to the release of transformers and things like that, what was your initial thought? Because I'm curious because there's not a lot of people there are a lot of experts today that really kind of started a couple of years ago. No fault on them. They see where the industry is going. Totally understand it. But what

06:28

was your thoughts? What was your thoughts when you when you first saw the attention all you need? The attention is all you need paper. So that would have been probably around the time I graduated college, around maybe a year or 2 after I took the course that I was just describing. So I I just started learning about, like, okay. Like, this is how, like, old school, quote unquote, like, embedding methodologies work. And the biggest takeaway that I got from those is that they work

06:56

pretty well. They work pretty well for, like, a lots of different kinds of queries. And I think what the attention all you need paper did was it kinda helped you, understand how to rigorously create representations of text that generalize way better than, any sort of, like, normal, keyword based, bag of word based search methodology.

07:18

And I think that at the time, I probably didn't grasp as much what impact the attention all you need paper would have on the field until we started getting embedding models that people could use really easily, like Roberta or Bert. And we're like, okay. Now we can do, like, multilingual search without any issue. Now we can represent, like, any sentence without keyword overlap when we wanna find some document that's interesting, without doing any

07:45

additional work. Like, once those papers started hitting the scene, I think now we start seeing, like, okay, this is what attention is doing for us. This is what the ability to, like, contextualize our vector embeddings is doing for us. And now we can see what's kind of getting benefited there. But I think I think my, understanding of how beneficial that was kind of lagged until we started seeing these other models kind of hit. And

08:06

I'm like, okay. Now I can kinda see why this is important and why, like, future and future models are gonna get better and better based on this architecture. Interesting. So so for those that don't know kind of and even I'm rusty on this. Right? Yeah. One of the things that was interesting about this was the in on this. Right? Yeah. One of the things that was interesting about this was the

08:21

in first, appearance. What was it? You you just described it a minute ago, but it was something like the the prevalence of a word in a bit of text versus the lack of prevalence and how that metric becomes was very important in in I'll call it classical natural language processing.

⁠¶ Co-occurrence and Meaning in Text

08:40

Right. So this is the idea that if you have words that co occur together in some document space, the meaning of those words are gonna be more similar than words that don't co occur in some other given document space. This is rooted in something called the distributional hypothesis, which is basically this idea and the other idea that, concepts cluster in in this type of

09:02

space. So what what does that mean actually? Right? So if you have the word like hot dog, it's probably gonna be seen in a corpus that's near other food related words than it would be if you picked some other word like space or moon. And there's something we can learn from that relationship to infer the meaning of what that word is and how we can use that meaning of that word to learn about what

09:24

other words are doing. So So this is kind of, like, the theoretical basis of, like, why we can represent words geometrically, with with a little bit of hand waving. But that's kind of the core idea. And attention kind of takes this a little further by allowing the representation of these tokens or words to be altered based on the words that occur in a given sentence. So you might have a word like does, like, does this mean something?

09:50

You might say something like that. Or you might say, I saw some does in the forest. Both spelled exactly the same, but have completely different meanings based on their context. And if you used a traditional, maybe, bag of words model where you're just counting the words that occur in a given document and kind of creating a representation of what that document looks like based on the words that are composed in there, you're gonna overlap and conflict with the meaning of those of of the word

10:16

does and does because they're spelled exactly the same. They might look exactly the same with this type of representation. But if you have a way of informing what that word means with its context, which is what attention allows us to do, then you can completely change how that's being represented in your downstream system, which allows you to do interesting things

10:34

with with search. So that's kind of, like, the biggest benefit that's coming out of that type of methodology, and that kinda enables what is now known as semantic search and retrieval augmented generation and so on and so forth. I was gonna say, that sounds very it's almost like it was, like, the old pre that error, the vectorization of this and the distance in that vector in that geometric space. I guess we've been doing that for a lot longer than most people realize in in a

11:00

sense. Yeah. I mean, looking through, indexes or document stores with some sort of vectorization has has has been, something that people have done, except instead of being dense vectors, which is, like, you have some fixed size representation that isn't necessarily interpretable to the human eye for some given query or document, it would be, like, the size of your vocabulary. So you think of, like, Wikipedia. You can find, like, every unique word on Wikipedia, and, like, that is gonna be how

11:31

big your vector's gonna be. And every time you have a new document come in, a new article, somebody's kind of, like, wrote up and published to Wikipedia, like, you're representing that in terms of its vocabulary. But now instead of doing that, we have, like, this magical fixed sized box that allows us to represent chunks of text in a way that is

11:49

extremely fascinating and abstract. And every time I think about it, it just, like, blows my mind, but that's kind of, like, the main kind of difference is the way we're representing that information and how compact compact that is and

11:59

generalizable it has become. Yeah. That is, like, it it's almost like you're, you know correct me if I'm wrong, but, you know, creating these vectors, these large vector databases, right, with, you know, 10, 12,000 dimensions, right, of how these words are measured in relationship to others. It's almost as a consequence of training a large language

12:21

model, you create a knowledge graph. Is that is that true? Is that really the case where, you know, like, you know, dog is most likely to be next to, you know, the word pet, you know, or it has the same distance. Is that I'm not explaining it right. No. No. No. You're you're on you're on the right track exactly.

12:39

And I think this is, like, one of the most fascinating qualities of even, like, what people would consider, like, older embedding models is this idea that you can take, like, a training test that seems completely unrelated to the quality that you want in a downstream model, and it turns out that that actually achieves that quality. So, what you were referring to, Frank, is this idea that you might have, like, a sentence. You might have, like, I took my dog out on a walk, and you might say,

13:05

okay. I'm gonna remove the word, walk, and I'm gonna have I'm gonna train some model that tries to predict what that word where I removed was. This is masked language modeling, which is this idea that you're

⁠¶ Masked Language Modeling Success

13:17

kind of getting at of, like, okay, what are the words and how are they in relation to the other words in that sentence? And it turns out that if you, like, do this with, like, 100 of 1,000 of millions of sentences and words, in some corpus that is somewhat representative of how people, use human language, you can act you will get really good at this task, number 1, because you're training the

13:37

model on that task exactly. But if you are training a neural network on that model, some intermediate layer representation in that model so somewhere in that set of matrix multiplications where you're turning this input sentence into some fixed size vector representation is gonna be a good representation of what that word or that token or that sentence is going to be. And the fact that that works is not intuitive. Right?

14:04

The the fact that that works has been shown empirically, and it turns out that we can kind of do that and kind of have these models work really well. And nowadays, in addition to kind of doing that, which is what we would consider pretraining on some large corpus, we now fine tune those embedding models on specific tasks that are important to us for retrieval. Like, okay, we have this query or question we're asking. We have the set of documents that might answer this question or might

14:28

not. We want a model that makes it so that the query's embedding and the document relevance embeddings are in the same vector space. So you're on the right track. That's, like, basically how these models are able to learn these things. I don't know if I would call them, graph representation, maybe a little bit of, being being pandactic on, like, use of words there because that can be a little bit, different how how you're organizing that information.

14:50

But you can make the argument that the way that these large language models are representing information is a compressed form of, like, the giant dataset that they're trained on. And we don't actually know exactly, like, where that information lies inside that neural network. There's some research that's, like, trying to get at answering that question, But you could, for the sake of

15:08

argument, be like, yeah. There's probably, like, a a a dog node somewhere in this neural network that knows a ton about dogs, and that's how we're able to kind of learn this information. That is the stuff that we don't exactly know. Interesting. Because, there was a really good video by 3 blue one brown, which you probably are I love that channel. Where he gives examples where, you know, famous historical leaders from Britain have the same distance from you change the country to Italy

15:37

or the United States have the same kind of distance. So you can kind of infer I'm not saying that the AI it almost seems like this knowledge graph is also is also a byproduct of of of building this out. Like, the there's some type of encoding or semantic, I guess, is this is really what it is. Right? Like, that that you get with it. And, I wanna get your thoughts because yesterday, I I caught the part the first half of the Jetson Juan keynote at c s CES,

16:07

which this you know, we're recording this on January 8th. Right? And one of the things that the video starts off with is, you know, the idea that tokens are kind of fundamental elements of knowledge. And I did a live stream where I'm like, well, I never really thought about it this way. Right? They're they're building blocks of knowledge or the pixels, if you will, of of of of knowledge. And I wanted to get your thoughts on that because, like, that kind of blew my mind and maybe I'm simple.

16:32

I don't know. Maybe I'm not. But it all it seems like we've been kinda dancing around this idea where and now NVIDIA is really fully, you know, going all in on this, the idea that, you know, these are not, this isn't an AI system. It's a token factory or a token score. What are your what are your thoughts on that? I'm curious.

⁠¶ Understanding Tokenization in AI Models

16:50

So when I started learning about how, like, tokenization works and how we're able to kind of, like, basically build these models without having massive, massive vocabularies, it is it is pretty it it is pretty interesting to be, like, okay. Like, maybe maybe there's some, abstract notion of information that each token has that is being that is what the model is learning during training time. And then we're just combining these sets of information in order to kind of, like, understand

17:21

what words mean or what documents mean, so on and so forth. Because when you look at how, tokenizers work and the size of the number of tokens for, like, maybe the English language or maybe, like, a really multilingual model like Roberta or multilingual e five large, they're a lot less than you would expect. Like, it's on the order of, like, maybe a 100000, 200000, 300000, tokens. So it is kind of odd to think about whether those tokens

17:50

themselves hold information that's readily interpretable for us. But I think that we've gotten so far with using systems that are just combining, the operations on top of these tokens in order to retrieve the information that these systems have learned, that there's definitely something important there. And I would love to, like, know exactly, like, what is happening when we're able to do that. The the

⁠¶ "Understanding Large Language Models"

18:12

heuristic that I like to use is, large language models are generally reflections of the training datasets that they've been trained on, and they're basically creating, like, really efficient indexes over that

18:23

information. And sometimes those indices hallucinate. And the reason why is because we are when we ask, quote, unquote, what a question to a large language model or query a large language model, we are kind of conditioning that model, on a probability space where every token being generated after is likely to exist given the query or the context or whatever we're passing to

18:46

it. And once you think about it that way, then it just feels like instead of thinking about what each of the tokens are doing, you're kind of just querying what the model has been trained on and what it will tell you based on what it, quote unquote, learned or knows. And then you can kind of run with that metaphor a lot and build systems on on top of that. That seems, much more actionable than thinking about, like, what each of the tokens are doing individually. Does that kinda make sense? No.

19:11

That makes a lot of sense. I think the whole gestalt of it is what really makes it magical. Right? Like Yeah. You know, you can you can obviously, I I don't this is not this is not, like, the newest iPhone or whatever. But, you know, if you go through the the text auto complete, you can maybe make a sentence that sounds like something you would write. But much beyond that, it starts getting weird. In early generative AI was very much like that, particularly the images.

19:35

Well, you know Don't like, yes. A 100% understand. I started learning about generative, text generation before we had instruction fine tune model. So are you familiar with, like, the concept of instruction fine tuning, Frank? I think I am, but I IBM slash Red Hat defines it one way. I would like to get your opinion. Yeah. So, this is the idea that you can train or fine tune large language models to follow

20:01

instructions to complete tasks. So, before we had, like, models that could that we could just, like, ask questions of and just, like, receive answers directly, you had to craft text that would increase the probability that the document that you want to generate would happen. So if you wanted a story about, like, unicorns or something, you would have to start your query to the LLM as there once was, like, a set of unicorns living in the forest. Blah blah blah blah.

20:27

And then it would just, like, complete sentence, just like a fancy version of autocomplete. Right. And that that's kind of, like, what we used to have, and that was pretty hard to work with. And then once researchers kinda cracked, like, wait a second. We can create a dataset of, like, instruction pairs and, like, document sets and fine tune models on them. And it turns out now we can just, like, ask models to do things, and they will do them. Whether or not

20:48

those are correct is kind of the next part of the story. But getting to that point, it was, like, pretty interesting and pretty significant. Interesting. Interesting. When I think of fine tuning, I think of I think of primarily InstruqtLab, where you basically kinda have a LoRa layer on top of the base LLM doing that. Is that the same thing? Or is it kind of slightly it sounds like it's slightly nuanced. So the nuance there is that, one, though this the methodology that I'm

21:22

describing is mostly dataset driven. So you have, like, your original LLM, and then you have, like, a new dataset that allows the LLM to learn a specific task. Or in this case, like, a generalized form of tasks, which is you have instruction, answer, user query, give it an instruction. Whereas in your case, you're kind of, like, adding another layer to the LLM and, like, forcing the LLM to learn all the new methodology inside that layer in order to accomplish a specific

21:48

task. So that's kind of like what client cleaning ends up doing. So the other way there's multiple ways to do this, it seems. Right? Like, there there's that way we add the layer, but there's also kind of I hate the term prompt engineering because it's just so over overblown. But, like, giving it more context and samples. And now that the the token context window is large enough that you don't have to be well, if you wanna save money, you have to be very mindful of that. But if you're running it

22:12

locally, like, doesn't really matter. Well, you could give it an example of let's just say you had I'm trying to think of a short story or a novel. I don't know. Let's pretend, Moby Dick was only a 100 pages. Right? I could give it that as the part of the prompt. Let's say write a sequel to this book based on what happens in this one. Is that what you're talking about? Were you kinda giving an example as part of the prompt? Or is there

22:38

some and not part of the layer? Or some combination thereof? Or was some third thing entirely? So this would be like, what what

⁠¶ Instruction-Following vs Few-Shot Learning

22:45

you're describing is more like few shot learning, which is you gave kind of an example, and then you're, like, okay. Like, given these examples, can you do this other task this test that I've described on this unseen example? What I'm describing is kind of, like, slightly before that. So, like, before we had the ability to, like, give models examples, we had to, like, give them we have to create the ability to follow instructions. And then once you have the ability to

23:07

follow instructions, you can be like, okay. Here are the instructions. Here's examples of correctly completing the instruction, now do the instruction. And that is the reason why that happens in that order is because first, you have, like, just, like, sequence completion, like, autocomplete. Then you have, like, okay, given this task given this set of instructions, just follow the instruction instead of, like, trying to do autocomplete. And then you have, okay, now you know how to

23:32

follow instructions. I'm gonna give you a few data points in order to learn a new task. Now do this new task. So you're kind of, like, moving from a situation where you need tons and tons of data just to get the, sequence completion. And then you need a smaller set of data to, like, get the capability to follow instructions. And then you need a very, very, very small amount of data, like, maybe 3 points or 10 examples or 15 examples to complete kind of, like,

23:59

a new task. So there's a lot of kind of nuance in, like, how modern LLMs are being used and how they're kind of trained and fine tuned, so on and so forth. And I think there's a lot of, like, important importance in, like, learning what what happened kind of before because the advancements have happened so quickly. It can be really hard to kind of differentiate, or, like, oh, why is why do models perform like this? Why

24:20

do things kind of happen like that? And even though, prompt engineering has kind of, like, let's say, traveled through the hype cycle where people were, like, really excited about it, and then we're, like, this

24:31

is not actually that interesting. Right. What's interesting is that, doing building a good RAG system or trivial augmented generation system, you really need to be good at prompt engineering in a sense because you're assembling the correct context for this model to answer some downstream question, And it's not

24:49

intuitive how to assemble that context. So understanding, like, how are these models are trained, like, whether they can follow instructions, how good they are at doing so, how many examples of information they need in order to accomplish some task really affects how you build that knowledge base in order to help the model do some sort of new thing. Interesting. So RAG is obviously all the rage now. Yep. But there's also a relatively new because this this

25:17

space changes rapidly. Like, I mean, I took 2 weeks off in December, and I feel completely disconnected from the cutting edge, you know. Because when I was watching the keynote from CES, and I'm like, wow. That's really cool. And I was texting, you know, slacking with a coworker, and he goes, oh, no. This is a retread of their, like, last keynote they did. Like and I'm like, okay. Wow. Blink and you missed

25:39

something. So what you're describing the fine tuning, is that really what Raft is, where the idea that you have kind of retrieval augmented fine tuning, which I think is what the acronym stands for. Is that not I'm not familiar with how Raft works. So I don't wanna, like, kind of venture and guess without without knowing what it is. But do you remember, like, what context you encountered this in? Basically, it's the idea that

26:06

it's the idea that you can fine tune the results. Sounds very similar to what you're doing, and I've haven't read the paper in a while. Back when I was a Microsoft MVP, like, you know, they had a Microsoft Research had the thing for their calls, and they were all raving about it. The paper had just come out and things like that. It's the idea that you can kind of give it pretrained examples.

26:30

You start with a base LLM, and you give it pre trained examples, and then you add on top of just to retrieve an augmented portion of it. It's very similar, not to plug my you know, for my day job. I work at Red Hat. That's why

⁠¶ "Rel AI: Open Source Data Tool"

26:43

there's a fedora there. We have a product called Rel AI, which is based on an upstream open source project called instruct lab. And it's the idea similar idea in that you you you basically give it a set of data. And then you we there's a there's a little more to it because there's a teacher model. And basically what it'll do is it will and synthetic data generation. So you can start with a modest document set.

27:10

And based on how the questions and answers that you form and the the the, the taxonomy that you attach to it, it will create a LoRa layer on top of an existing LLM. And it it could be that it's it's it's not quite exactly the same as Raft, but it's definitely in the same direction. Same same thing as, like, Bert, Elmo, and, you know, Roberta, which, I think

27:37

I think I understand. So it's kind of like you so the I think the problem that might be addressing is kind of just really similar to the problem that traditional RAG tries to address, except in a more kind of deliberate fashion Exactly. Yeah. Where you have some document store internally. Like, let's say we both work at some company, and we have a giant customer support document store. You take some LLM off the shelf. It's not necessarily gonna know the

27:59

contents of your internal kind of documents. So how can you get it to, like, successfully help answer tickets or triage tickets that you're trying to build, so that you can answer, like, most difficult tickets and kind of work toward that. In this situation, maybe you want to, inject some of the knowledge of the documents in addition to having the model being able to search over the document store. So maybe, like, the what this

28:24

lower layer is doing is, like, absorbing Yeah. Some of the knowledge from the document store so that you can kind of more efficiently query, the database and so that you don't have to, like, query it all the time. The only, issue, quote, unquote, I'd have with that method is that you'd have to, like, keep

28:43

that updated from time to time, and that's, like, not that's nontrivial. Whereas if you just do, like, traditional RAG, you just need to update your, Vector Store, and then you can just have the model query that new information when you need to. But, you know, it's always best to use whatever solution works best for your, given use case.

29:01

And experimenting with different use cases is always really important. But I imagine that's, like, kind of what that is trying to address, which is the That is basically it. The I, you know, I don't wanna go down that rabbit hole of that. But but, basically, the idea is that, if you train an LLM or you have a layer on top of an LLM that not only does retrieval from a source document

29:22

store. Right? I think that's a pretty set pattern. But it also has a better understanding of your business, your industry, the jargon. Right. Right. Blah blah blah. Right? The idea is that the retrieval success rate will be higher. Now we're not publishing the numbers yet,

29:37

but the research is still ongoing. But basically, it's a pretty substantial from what I've seen well, I haven't seen the actual numbers yet, but from what I've been told those numbers are by the researcher, that it is a it is a substantial improvement that is worth the, the juice is worth the squeeze in that in that regard. You're not and it's also computationally, you're not quite training the whole thing again. You're just kinda putting a new Instagram filter, so to

30:03

speak, together on top of the base. So it definitely does it definitely does some things. Now when we get the hard numbers, then, you know, I mean, I can say them publicly, then I think we'll we'll know is the juice how much does the the the the squeeze to juice ratio is? But, I can confidently say publicly now, like, there's a there there. Yeah. And, you know, we'll have those numbers soon enough. But it's it's interesting because you're right. I mean, this paper

30:33

came out in 2019. Right? There was just an explosion of these different mechanisms. You mentioned Bert. You mentioned Roberta. Fun fact, my wife's name is Roberta. So that was kind of fun. There was Elmo. There was Ernie. There was a whole Sesame Street themed zoo of of model types. That seems to have kind of that branching out of those different directions has seemed to have stalled, and we're going into more of

31:00

these retrieval augmented generation systems. So for those who because not everybody on our listeners know exactly what retrieval augmented systems are. Could you give kind of a a level 200 elevator explanation? Sure.

⁠¶ "Retrieval-Augmented Generation Explained"

31:15

So, when you speak to a modern chatbot, what's happening is that they've learned information through their pre training processes, the large corpus of basically the entire Internet, and are generating information based on the query that you're passing in. The problem that often occurs is that these AI models might error, and the error could be making, inform making information up that doesn't

31:42

exist. For example, if a model is trained before a period of time, like, it might not know about that period of time, which is which happens more often than you think. The information could be false, untruthful, or it could just be incorrect in a way that's not, like, bad, but still not helpful. And the reason for this is the way that these

32:00

models are accessing that information. The idea behind retrieval augmented generation is that instead of having the model try to, generate the correct document or the correct response given its pretraining process, you instead add factual content to the query that you're asking the model for. You first search for that content, which is where the retrieval part comes, and then you augment the generation of what that model is going to create based on that content, hence

32:29

retrieval augmented generation. There's usually, a querying step. So you take in a user query, you hit it against some sort of database, usually a vector database. In our case, it could be Pinecone. You find a set of relevant documents. You pass that to the generating LLM. The generating LLM uses those documents to generate a final response. And it turns out that if you do this, you can reduce the right

32:50

hallucinations. And that makes sense because if the model was given true information and then conditioned its generation on that information, it follows that the probability of generating information that is correct could be higher. That's a good exam that's a good explanation. So you're basically giving it a crash course in what documents you care about. Right? Like Exactly. Interesting. And that's a good segue because you work for Pinecone. So so tell me about Pinecone. What is Pinecone?

33:20

Yeah. So Pinecone is a, knowledge layer for AI. It's kind of like the way we like to describe it. We the main product that we provide is a vector database. So this is a way of storing information, information that has been vectorized, in a really efficient manner. And it turns out that if you have the ability to store information in this manner, you can search against it really quickly, with low latency and to find the things that you need to find really interesting for

33:46

these types of semantic search and rag systems. Pinecone has a few other offerings now that kind of help people build these systems a lot easier. There's Pinecone Inference, which lets you embed data in order to do that querying step. Pinecone Assistant, which lets you just build a RAG

⁠¶ "Pinecone: Efficient Vector Database"

34:01

system immediately just by upsurting documents into our vector database, so on and so forth. But the reason why, like, you need a vector database is because all of this advance of semantic search of embedding models. People have gotten really, really good at representing chunks of information using these dense sized

34:20

vectors. But once you have 1,000, millions, even billions of vectors across tons of different users, you need a way of indexing this information to access it really quickly at scale, especially if your chatbot's gonna be querying this vector database really often. And so having a specialized data store that can handle that type of search becomes really useful. That's why Pinecone is here, and that's

34:42

why we exist. Interesting. Interesting. One of the other interesting things from your bio, aside from the the the origami, Tell me about this. So so you your crew does your do you create the YouTube videos, or do you use your tools, or is it something completely it's just part of your job as a developer advocate? So it is just part of my job as a developer advocate. Oh, okay. Like, often that, you know, I do that because we are interviewing people or because there's a new

35:16

concept we wanna teach people, so on and so forth. Or we do a webinar, and we just upload it to YouTube. Oh, very cool. Very cool. Yeah. I started my career in developer advocacy. One was called evangelism. So I was a a Microsoft evangelist for a while. So yeah. Yeah. Cool. YouTube is very important. Yep. But it's also it's also, I think, speaks to how people learn, but, how people learn. YouTube University is very

35:47

real. Right? And Yep. You know, not not a knock on traditional schools, not a knock on traditional publishing, but this space is moving so fast that if it weren't for YouTubers like 3blueonebrown I think his real name is, Grant Sanderson. I think that's his real name. Somebody will send me hate mail if I get it wrong. But, he he is, like, really good at explaining these really abstract mathematical concepts. And unlike you, I didn't study math undergrad. I didn't I mean, I had to. I

36:19

only took the requirements. Right? But I have comp sci degrees. So, like, for me to kind of fall in love with math again or for the first time, depending on depending on how you wanna say that, for me, that was very helpful. And under having an understanding of this, if you're a data engineer and, you know, or wanna get into this space, it's definitely vector databases for traditional kinda SQL kinda

36:41

RDBMS person will look very awkward at first. But I know a lot of people that have made the transition, and they kinda love it. Right? Because in a lot of ways, it's way more efficient, than, I dare say, traditional data stores. But when you're processing the large blocks of text, it's really good for kind of parsing through that. But that's that's really cool. So, we do have the preset questions if you're good for doing those. I'll put them in the chat in case

37:09

you don't have them. Sure. They're not brain teasers or anything like that. They are pretty basic of, questions, and I will paste them in the chat. So the first question is, how did you find your way into AI? Did you did you find AI, or did AI find you? So this is a little bit of a

⁠¶ "AI Found Me: Intern to Innovator"

37:33

crazy story, but AI definitely found me. So when I was in college, when I was looking for my 1st internship, I couldn't find any internships, basically, because I had, like, no previous experience in working at tech or anything like that. And, the first company I worked for, Speeko, took a chance on me because they were building public speaking, tools to kind of help people learn how to do public speaking better, for an iOS app. And I had some

37:59

public speaking experience. They were, like, close enough. We'll have you come on and kind of help us, like, work work things out. And while I was there, it was made very obvious to me how important building very basic deep learning systems and AI systems to kind of accomplish really specific tasks that could help serve an ultimate goal. Like, what we were trying to do is just, like, see how many filler words people are using or how quickly or slowly you were speaking.

38:24

And that requires a lot of, complicated processing because you have to do transcription and because you have to figure out what words are being said, so on and so forth. So kind of experiencing that and seeing that firsthand really opened my eyes to how powerful

38:38

the technology had been even back in, like, 2017. And ever since then, I started learning more and more and more about statistics, AI, natural language processing through my internships, learning more complicated problems, reading research papers, so on and so forth. And I got to where I am now. A lot of where I learned is just out of pure curiosity. Just like, okay. There's this new thing. I wanna learn

39:00

about it. That's where I wanna be. And that's kind of how I fell into large language models and AI, just by wanting to learn about what was going to happen and then eventually being there. So it definitely found me. I was not looking for it. Didn't even know I liked statistics until I started doing statistical modeling. And I was like, wait. This is really fun. I wanna do a

39:17

lot more of this. I wanna learn a lot more of this. And I knew that, once I was in college and I bought a statistics book for fun, and I was like, okay. I'm I'm past the point of no return. Like, this is definitely Right. Right. Right. Right. That that might be one of the first times in history that that's been said. Right. Because I I learned statistics for fun. I I took stats in college. I hated it. Hated every minute of it. But when I got into data science,

39:44

I the first two weeks were not fun. I'm not gonna lie. Yep. But just like the VI editor, once you stick with it, Stockholm syndrome kicks in, And you start loving it. That's cool. 2, what's your favorite part of your current gig? The favorite part of my current job is being able to learn interesting, fun, even complicated things in data science and AI, and figuring out how to communicate them to a wide

40:14

audience. It's a really fun challenge. It's really similar to, like, what, 3 blue one brown does all the time on the YouTube channel, and it's something that I get to learn and practice and keep keep doing. That's the best part of the job. I love learning things and, like, teaching other people about them and learning even more things. And the fact that I have an opportunity to do that every single day is, like, the best. That's cool. That's cool. We have 3 complete sentences. When I'm

40:39

not working, I enjoy blank. When I'm not working, I enjoy, baking sweet treats and goods. I can't have any dairy. So very often, I had to kind of give up a lot of the cakes and desserts that I loved eating when I was younger. So now I, like, spend my time trying to figure out how I can make them again without dairy so they taste really good. So that's that's something I enjoy I really enjoy doing. Very cool. Next, complete the sentence. I think the coolest thing in technology today is blank. I

⁠¶ "Impact of Code Generation Models"

41:10

thought really hard about this question because we're living in a crazy time of technological development. But the thing that really stuck out to me and the thing that was also the moment for me when I started working with, like, chatbots and LLMs was code generation models. The first time I learned how to use, GitHub Copilot specifically, I was I was completing some function, and it completed it before I was done typing it. And I was like, what the heck? This is amazing. Like, this this this

41:40

actually figured out exactly what I needed. And because I was still, like, a budding developer, it was extremely helpful because I could learn faster rather than having already a huge kind of store knowledge already in my brain and kind of pulling from that. So I could see it benefiting my workflow. So I think the development of those tools and modern tools like Cursor, so on and so forth, extremely cool. And I can't wait to

42:01

see, like, what the next generation of those technologies will look like. Yeah. I mean, that's a that's a great example. It's almost like you don't need, you know, the the classic 10000 hours to master a skill or something like that. It's almost like you can leverage the AI to take on the lion's share of the 10000 hours. You're still gonna need to know something. You still have to put in some reps, but not to the degree that you used to.

42:23

No. I think that's gonna be very transformative. I mean, I mean, I'm learning, JavaScript and Next. Js on the side because it's something I have no experience in. Right. And I was able to build my personal website entirely through using Cursor and Progression. Nice. I often check that out. Which is insane. Right? Which is, like, really, really fascinating. And and I'm not gonna claim to, like, suddenly be an expert in

42:45

NextGen or anything like that. Right? Right. Right. Right. I still wanna learn, like, exactly what's going on under the hood, But having a project that you can kind of, like, tinker on that's, like, pretty small in scale and that you can kind of afford to make a few mistakes on and having, like, an expert system kind of help you go through that, expert, quote, unquote, being close enough, really cool learning experience. No. That's a great way to put it because, like, I I

43:06

I don't have any apps on the modern devices. Right? Like, so, it would be nice if I had an Android app that could kick off some automation process that I have. Right? Or do some kind of tie in with, you know, Copilot into that or things like that. Like, where, you know, I originally wrote a content automation system I wrote. I originally wrote in dotnet, but I ported it to Python with the help of the help of AI. And I could well, that's just it. Right?

43:36

It really the true valuable resource in in life is time. Right? Yes. It's not Yes. I mean, I could have done it by hand. I could have done it by myself, but it was one of those things where am I gonna do it because it's gonna take x number of hours or whatever? But if I can just kinda here's the dot net version that I, you know, I posted. This is before there was Copilot, so I pasted it into chat g p t. And it basically spit out a Python

44:03

version, had some errors. You know, this was a while ago. But I was able to, inside of a day, get it done as opposed to before. Like, I know how my ADD works. Right? Like, I'll start it. First 3 days, working on it, grinding on it, and then I don't touch it again for 2 weeks. And it never gets built. But with this, I'm able to kinda harness the the spark of inspiration and and execute much faster. Now I think I don't think

44:29

people fully realize, like, you know, it's not all doom and gloom. Nobody's gonna have any programming jobs. There's a lot of upside too. And I guess that's just where we are in the hype cycle. As you said. Yeah. Yeah. Yeah. Exactly. That's a good segue into I look forward to the day when I can use technology to blank. I look forward to the day where I can use technology to get a high quality education on any subject for free. So Nice.

44:56

Free education is really important to me. A lot of what I learned about large language models, deep learning, all that stuff was online courses that I took for free on places like EDX, Coursera, so on and so forth. Or people sharing articles and kind of learning from them, or YouTube videos, or all that sort of things, in addition to my education. But there's a lot of things you kinda have to learn after that. Right? And I think that especially with, like,

45:20

cogeneration models, it's, like, very easy to be, like, okay. Build me this app and, like, just make it work. And you can sit there for a couple hours,

⁠¶ Personalized Learning Path Technology

45:26

and it'll, like, work. But I think the missing piece is creating a structured kind of learning path that's, like, personalized to whoever you are for the thing that you're really interested in with the context of having, like, these tools that can help you do that thing. And I'm not sure if we have anybody or any offering that can kind of do that technologically, because you need a lot of information about what the

45:51

user knows or doesn't know. You need to be able to create ability, and then you need to be able to kind of create, like, an entire mini course that's personalized to whatever that person needs. But if we can do that, we can solve so many wonderful problems. Absolutely. I'm thinking about special education needs and things like that. I don't think we're that far off from this. No. But I

46:12

the biggest issue, is going to be just hallucinations. Right? And, hopefully, people can build, like, rag systems using tools like PineCone to kind of produce those hallucinations. But we will also for for something like that specific use case, we probably need, like, another breakthrough in indexing information or kind of presenting it, or we need a process that really allows people to create this information quickly

46:34

and verifiably in order to kind of make that happen. But if if that is a future that we can live in, where technology can can kind of, like, help people learn, like, really important things really well, that would be wonderful. And I think that would be, like, amazing for for humanity. Oh, absolutely. Share something different about yourself, but remember as a family podcast.

⁠¶ Mathematical Complexity in Origami Design

46:57

One of my favorite hobbies for about a decade is designing and folding origami. And it's really fun. It's very easy, but it's also very hard. There's a lot of comp complexity inside it as well. One thing people don't know about that is that there's a lot of mathematical complexity.

47:15

So once you get to a point where you wanna design a model with really specific qualities, really specific features, it suddenly becomes a paper optimization problem where you have, like, a fixed size square, and you have different regions of that paper that you're allocating to portions of the model you're designing. And it turns out that there are entire mathematical

47:37

principles and procedures to solve this problem. So much so that one of the leading, like, practitioners in the field is, like, this physicist who wrote a textbook on how to do origami design, and that's, like, the textbook everyone looks at. So, like, learn how to solve it. Yeah. I'm not surprised. There's definitely there's definitely a a correlation between the mathematics of that. And I look at origami creations, and I just fascinated that could be done from a single sheet. Like, it's

48:03

just how is that I mean, that's just mind bending. Now it's and and makes sense that there's a mathematical because you have a certain type of constraint, And there's obviously folds factor into it and things like that. And, yeah, that's that's interesting. I I should what's the name of that book? I should pick it up. It's called Origami Design Secrets. Got it. Alright. I will check it out. So where can people learn more about you and Pinecone? Of course. You wanna learn more about Pinecone? The

48:32

best place is our website, pinecone. Io. You can also find us on LinkedIn and on x and other social media platforms. You wanna learn more about me? You can go to my LinkedIn, which you can find at Arjun Girthi Patel, or you can go to my website, which is also my name, arjun, k I r t I p a t e l.com. Cool. And we can also check out your Next JS skills there too. Exactly. Hopefully, nothing is broken, but, you can you can see you can see how well I've gotten by

49:01

with the Awesome. Trust me. JavaScript alone is is a is a frustration creation device. Audible sponsors the podcast. Do you do audio books? Is there a book that you would recommend? I do do audiobooks, but I've just started recently, so I don't have a huge, audiobook library. But there is I I am a huge fan of short story collections, and kind of the one that comes to mind is really anything by Ted

49:30

Chiang, who does a lot of kind of sci fi short stories. If you've seen the movie Arrival, the short story based on that is story of your life, and it's wonderfully written. It's one of my favorite short stories ever. Yep. So highly recommend that. I believe the collection is called, story of your life and others, something like that. So Oh, interesting. Careful with audiobooks. They are very addictive. So,

49:55

with Audible is a sponsor of the show. So if you go to the data driven book.com, you'll get routed to Audible and you'll get a free book on us. And if you choose to subscribe, we'll get a little bit of kickback. It helps run the show and helps, helps us bring, bring some good stuff to to the masses. So any any parting thoughts? No. But thank you so much for having me on, Frank. This was a ton of fun. I learned a lot from you, and I hope I I helped you

50:24

learn one one small thing as well. Absolutely. It was it was a great conversation, and, we'll let the nice British lady finish the show. And that's a wrap for this episode of Data Driven, where we

⁠¶ "Data, AI, and Origami Insights"

50:35

journeyed from the intricacies of vector databases to the surprising elegance of origami. A huge thank you to Arjun Patel for sharing his insights on retrieval augmented generation and his passion for making AI accessible to all. From turning raw data into actionable knowledge to turning paper into art, Arjun

50:54

proves there's beauty in both precision and creativity. If today's episode left you curious, inspired, or just itching to fold a piece of paper into something meaningful, be sure to check out Arjun's work and Pinecones innovative tools. Remember, knowledge might be power, but sharing it makes you a force to be reckoned with. As always, I'm Bailey, your semi sentient guide to all things data. Reminding you that while AI might shape our future, it's the human touch or sometimes the paper fold that

51:23

gives it meaning. Until next time, stay curious, stay analytical, and don't forget to back up your data. Cheerio.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript