Hi, I'm Matt Turk from FirstMark. For this first episode of 2025, we are starting with a bang, with an awesome conversation with Chip Huyen. Now, Chip is a well-known writer and computer scientist who has taught AI at Stanford, worked as an AI engineer at places like Nvidia, Netflix, and Snorkel, and has become a bit of a superstar in the AI community through our influential writing.
In today's episode, we discuss a brand new book titled AI Engineering, an impressive guide about how to build production AI applications on top of foundation models. This is a very meaty and educational conversation where we covered a lot of ground, including what is different about AI engineering. Nowadays, anyone who wants to leverage AI to build applications can leverage one of those amazing available models to do so. How to evaluate AI systems.
As a more intelligent AI becomes, the harder it is to evaluate it. A lot of the failures are silent. Why prompt engineering is underrated. People don't take Chrome engineering seriously because they think there's not much engineering to it. Anyone can do it, but not many people can do so effectively. Why RAG is here to stay, why planning for AI agents is so hard, and so much more.
There's tons to learn in this episode, both for technical and non-technical folks. So please sit back and enjoy this fantastic chat with Chip. Chip, welcome. Hey Matt, it's great seeing you again. I'm a big fan of the work, love the jokes on Twitter. So it's really nice catching up again after like following you for so long.
Appreciate it. So today we are going to talk about your brand new book published by O'Reilly, which is just coming out, entitled AI Engineering, Building AI Applications with Foundation Models. which I must say is incredible work. So I spent a good portion of last weekend reading it and I thought it was amazing. Absolutely a must read for anyone that's serious about the AI field.
And in particular, what I found really interesting is that there's plenty for technical folks. There's like math, there's in the weeds kind of details. But equally, I found it very approachable for non-technical people, which is very... hard to do. So again, really enjoy it. Congrats. And to jump to the punchline, you know, people should absolutely get the book. And what we're going to try today is give people a little bit of a flavor for what's in it.
So obviously we're not going to cover everything because it's 500 pages of goodness. But hopefully that will give people some kind of overview. Does that sound good? Yeah, thank you so much. And everyone listen to Matt, he knows what he's talking about. So appreciate it. All right. So let's jump into it. So at the beginning of the book, you make the point that while
AI adoption seems new. It's built upon techniques that have been around for a while, like language models, some of which came in the 1950s, and then retrieval techniques. But at the same time... it feels like a new field. So what is new about AI engineering and how is that different from...
more traditional machine learning and ML apps techniques? Yeah, I think that that's a great question. And I get asked that question a lot. It was like, okay, what is engineering? Is it another marketing term? How is it different from my traditional ML engineering?
So there are a lot of overlap between these two roles. And I think a lot of companies, even people with the same title, can have... very different functionality so i think like any definition is like a little bit like fuzzy and like really depends on like where you work and what you're working on uh but in in general i think of like machine engineering is when you like have to build the models yourself like before like before
availability of like large language models or foundation models that anyone can access if you wanted to build ml applications you could need to build the models yourself and only a few organizations could do that but nowadays like Anyone who wants to leverage AI to build applications can leverage one of those amazing available models to do so. And it just makes it so much more accessible. And another thing is that before, I had thought that a small improvement of AI capabilities.
could lead to a small increase in the number of available applications. So we know for a long time, we have known that if we put more data and more compute, we would get better models, right? But still, when Chatree came out, uh we were shocked like i at least i was like i was in a group chat with a bunch of my friends and was like really really shocked and the reason is that like we were shocked but like
just a small improvement in capabilities can lead to so many, many applications. So at the same time, I have so many new ideas and that is the same time where like, it's so easy for what you build applications. It's just like the energy, the company is just like growing like exponentially.
Really, really, really exciting time. So they're like... a lot of new things with that uh one is that like um evaluation has become so much harder so before with a lot of charging ml we have like okay uh if we do like spam predictions right like we know that's the app which means spam or not spam so if the model
output and not spam, and the real email is a spam. Then we know that the prediction is incorrect. But now that you ask the model to say, hey, summarize a book, right? And the summary looks like quite reason-like. coherent you don't know if it's a good summary or not it might actually have to read the book to find out yourself and also like um
As a more intelligent AI becomes, the harder it is to evaluate it. So first of all, like for math problems. So I think that most of us can tell if the solution to your first grade of math questions is wrong or not. At least I hope.
At least I hope that most of us with a lot of complaint about education going on. I'm not sure if that's still the case. But yeah, but like for the PhD level math questions, I think it's very few of us can actually tell. So it's just like now AI can just like have like...
questions are not very correct. A lot of the failures are silent. So it's really, really hard to evaluate. And another aspect is a question of of like a product so i think like before right like um you you knew what you were going to build because like you're part of the organization so once you leverage data and ml you like speed up whatsoever make more money uh so so you knew what we're going to build
uh but but nowadays it's like you can start you can do anything so usually stuff like a product right like you start with an idea demo and then it goes really well then you start investing in data to make it better. And then if it's cooked really well, it's like, okay, now everyone, we are paying too much money for like...
OpenAI and Antropic or Google. So now we need to build our own model. And so now we invest into the model. So like the process now is like, yeah, like product to data. Is that the... reversed process compared to traditional machine learning where you would start with the data and then build the model and then build the product? Yes, yes. So I think the process is kind of reversed. And also leads to people like... a lot of get closer like product people and
ml people and data people like a lot closer like in some team you might have like to be like the same person so like it requires engineers to have a much much better product sense so i i'm a complain often here is it's like okay aspect of this application is really easy. But understanding what users want is really, really hard. So does it mean that a traditional machine learning engineer and an AI engineer may be two different people, two different professions? What is the overlap between both?
So I see it's like very differently at different companies. So for example, I see a lot of companies, they only have an ML team. So when they started adopting... AI, foundational models, data of AI, then they task this team to like, okay, go and explore and build stuff, right? So I think it's like overlap. I also see a lot of teams like hiring separately for data of AI. So it really depends on organization.
organizations, one thing I do notice is that it's not a question of or. Do I use a classifier or GF-AI? It's more of like N. It's a vast majority of GF-AI systems I've seen. you have like traditional or like analytical ML components with Java AI. So let's say, for example, of a customer support chatbot. So you have a request from customers and then you want to respond to that. So maybe like you have like...
you employ a bunch of models. Maybe a real difficult request, you want to send it to the strongest and most expensive model. But the easier request, you send it to maybe a locally hosted open source model. But some requests are sensitive. a billing complaint. So you might want to interrupt your human operator, right? So when you get a request, you need an intent classifier to say, hey, what is this request about?
So, and then you can route that accordingly. So that entails a classification model. can be a traditional ML classifier. Or another thing is that when the model, maybe a fancy model, generates a response, what if it contains some personal PAI information?
have like a detector like hey does this contain pi or not so that can also be like a traditional machining model so i see that used together like in a lot of cases so i do think that there are a lot of overlap And I do think just like people can like... If you're a traditional ML engineering background, you can also just learn more about foundation models and how to work with foundation models to become AI engineers. And so see that people coming into AI engineering will absolutely know.
ML background because you can just do an API call and you can't really explain gradient descent. And not necessarily, I think it's a big debate on whether people need to know that to come with an engineer or not. But I definitely see a lot of people building very good applications without traditional ML background.
And just to play it back, what you're seeing in the field is hybrid systems that combine foundation models and generative AI systems and traditional machine learning models. I'm just repeating this because... That seems something that... just about every practitioner in the field sees and agrees with. However, in the general public, there seems to be a narrative that generative AI is completely ripping and replacing all forms of AI that came before it.
But just to confirm, that's not what you're seeing at all. I would love to see people unseating XGBoost. I think it's like an uphill battle. So the January value stack, that's the foundation of... of generative AI, what are the different components that people should know? into like building applications i would think about like maybe a developmental process and maybe the stack should be like evolve
to address your needs, right? So when you start with applications, maybe you start building night thinking about, maybe you start with testing out the models. So you might want to do some problem engineering and see how far you can get with problem engineering. or maybe you need to curate some evaluation metrics, you definitely need to evaluate to design some evaluation metrics. So I think this is like the application development layer.
right like uh with problem engineering uh maybe it will be like how to like enforce a structure output uh security guardrails um maybe like with uh definitely with evaluations and then like after you do a silicon application layer and then we we go and then we max out that performance there. And then we're going to, hey, maybe we need to change the model, right? Like maybe we need to like fine tune the model. We need to make the model smaller and make it faster, like inference optimizations.
that layer when you actually make some changes. to the models itself so it's like the model development or a fire tuning layer um and then like after that uh i think you go into like uh infrastructure and you deploy the applications and now we have a you need to scale up right now to think about You have to think about what is the direct store. So that's the infrastructure.
static building as a platform. So that can make the deployment and iterations faster and more reliable. So basically, I think it's almost like three layers, like application layer. application development layer on top and then model development layer uh encompassing both my developing a model from scratch and my fine tuning and making changes to the models ideally you don't have to build a model from scratch and then as a bottom
like infrastructure is just like powering everything. So scale, which you mentioned a second ago, seems to be or is very much, I should say, at the heart of the entire generative AI. approach. What is it that's so special about language models that make them so reactive to this scaling approach that led to the chat GPD moment?
Yeah, that is a really good question. And I feel like that's a question that I think today a lot of people take for granted. But it was not obvious before that language could be the way to scale intelligence, right? I think in the early days, at least like... when I got into AI, it was like in 2014, people were still debating, like, would there be computer vision or language or reinforcement learning? Because computer vision was like, oh, because we developed ways to see.
way before we started developing languages. So maybe seeing is a way we scale intelligence, right? So it was not really obvious. And I remember back in 2017, I was at this OpenAI party and somebody told me, hey guess what we just like keep on throwing text and now this model is like pretty smart so so it is like it was like people just like start like realize it's like okay we just keep on getting more texts then we get like much much better models and
why language modeling is right because they want to like other text models like machine translations i was actually very bullish on a machine translation that was a really difficult task back in 20 2014 2015 um and now it was like pretty much like people are saying that machine translation is like pretty much sold for like major languages i mean we still have a long tail of like less uh lower resources uh lower resource languages um but but yeah so so the thing about language modeling is that
It's a very simple task and it's really elegant. So the idea is that you can predict. The idea is to try to build, to get. enough statistical information about the language so that you can predict like what comes next in a sequence right the idea is I even say like hey my favorite color is then it should be able to predict that blue is going to be more likely than, say, car. That's what you call the autoregressive models?
Is that right? Yeah. So I think I said that autoregressive is definitely one type of language models. So I think that's the idea like all language models is that you encode statistical information. And that concept actually is not new. I think like people employ that too. to break code during World War II. People are using that for games. And it's very, very interesting. So autoregressive, like you mentioned, should predict what comes next, whereas the mask one is like you can have a couple
context from both before and after and predict what is the middle. And both what kind of language modeling tasks actually can the data is like abundant.
Like if you have some tasks like machine translations, you would have to like curate like here is the original sentence and here's the translations. And it can be quite painful to curate that. But for language modeling, because you can just have any... natural text like online like it's like so much of it online you can just use it and also like not just like a natural text you can use like programming languages i can use like code bases um and and i think like that
It's just like that's the nature of it. It's like you don't need to cure it. labels, like reference data, so that you can use to train models that make language modeling so much easier to scale than other types of tasks. And maybe to put... terms around it can you quickly define for us supervised versus unsupervised versus self-supervised yeah so the approach of like you have thing like
purposefully curating the labels for the model to train from. Like fraud detection, for example, you have to train it down, maybe like here's the transactions, here's the label, and the label can be like fraud or not fraud, or like spam detection. the data is an email and then the label is like spam or not spam.
to curating like creating those labels the process of manual creating um like and it takes the model like you like learn from those labels so that process is like um it's like supervision so the other spectrum is like unsupervised it's like you don't need to tell the model like uh the the labels and the model figure it out first for clustering so if you throw in a lot of like uh articles to them
model and say hey try to like group this into like five groups so you don't need to tell some models like okay this group is technology or something like that right like you just like The model can do it. A lot of clustering algorithms are unsupervised. Language modeling is somewhere in the middle. It's self-supervised. And the reason is that it's still learned from some labels. next word in a sequence is a label that you will need to learn to predict but these labels come like naturally
Like, you don't need to manually curate it. Like, you just, like, get any text and you can, like, generate a bunch of training samples. So, yeah, so that's the self-supervision. And still on the topic of scale, why does... Does it matter how big a model is in terms of, you know, millions or billions of parameters? What difference does it make? That is very interesting.
question as well. I love how you're asking this very deep philosophical questions. And I feel like I really need a whiteboard to explain all of this. So people, if you don't understand me, trust me, my writing is better than my speaking. So like, why, what is it? matters as a model should have a lot of parameters. So parameters, like there's a number of parameters, usually like approximately it's like
the model's learning capabilities. So you can think of having more parameters and more ways for the model to learn information. I'm trying really hard not to use a term like neurons in the brain, having more synapses, you can learn more. It's not that equivalent. But yeah, so basically you can think of a number of parameters, it's more learning capabilities. learning capacity of the model. So the more parameters allowing the model to learn more. So it's actually a very interesting question.
Why do larger models need more data to learn? Because the idea is that if the model has more capacity to learn, should it need less data to learn? Does that make sense? If someone is smarter, right? Learn faster from less data. Yeah, yeah, yeah. So I think this idea is because it has more capacity to learn, it could be a waste of this capacity if you don't exist more.
So, yes, you can train a large model using very, very small data set. But that could be a waste of compute and the waste of that model. potential you might you might achieve like much better performance it's like training a smaller model with that with a smaller data set so yeah so like larger models allows them to learn more and they give it more data allow it to like maximize its learning potential
and be able to do much, much more powerful tasks. Could you go into some other approaches to make a smaller model very performant? In particular, I'm thinking of a mixture of experts. Can you maybe define for us what that is and what the general goal is? So I do think it's like the goal of making the model smaller. Making a smaller model better is actually a very important goal, and it's just what everyone is trying to do. So one thing I want to point out is what we consider small.
or large it's actually very time dependent like what is considered like large uh 10 years ago is considered tiny today so i feel like what is considered like large today might be considered like smaller in the future um so so so and we have seen time and time again for some like uh the same llama model families right like the llama three models a smaller model in llama three uh families probably perform better than the bigger
model in the first lama uh generations so so like it's it's like um it's like over time we actually learn more like how to make model perform better with like being smaller size So how to make a smaller model better? One, we can use better data. We have seen that higher quality data can actually lead to better performance. People show that a lot. Better training techniques, new alignment techniques.
Also, like maybe like different architecture that you mentioned, like make sure of experts. So make sure experts are interesting term because it's like it has been.
reuse for different meanings over over the years right like before we have the expert systems um and make sure expert means different things from like what people call like the mixture of expert models nowadays so so the ideas is that like um first of all like you you can have like not quite human expert you can have like different um maybe like um okay maybe you can divide them more into different heads
right and each head specialized into some things and then like these heads like instead of training like experts like entirely from the beginning like which like have a lot of parameters you make these experts share some parameters and then it has some kind of router in the middle to determine like which head is the most suitable. So the idea is like these different components can share.
parameters to make it more efficient, like parameter efficient, where I can do multiple kind of like complicated tasks. Yeah, so I think there's definitely like one pretty interesting approach. um but also like they are more
It's hard as you train. So usually I don't see people saying, Hey, I'm going to make a mixture of expert models today. And people don't wake up and like, once you do that, you know, it's pretty, it's pretty hard. I think like for a lot of people, they could bring you something like. hey, I maybe do quantizations, which is very universally working really well.
for a lot of tasks across model another people might try to do is just like doing like distillations so like you have like a bigger model teaching a smaller model like so that's a smaller model like learn to mimic the behavior of the of the bigger model um so yeah so so i think is that that's a very very very fascinating question that you asked um and i might write another book about it
You heard it here first. Okay. So speaking of training, walk us through, in a nutshell, the different phases of how you train. those models. And so there's a pre-training, but in particular, I was very interested reading the book about the post-training phase, which is, there's a lot more to it than I had read about previously. So yeah, maybe walk us through the steps.
please yeah uh i really kind before i go into it is when i say i hate the terms like it's like pre-training and post-training it's like very a little bit like confusing And I feel like the AI research community is really great in many things, like naming is really not one of those. So I think it's like the process of training, like creating a model like ChatGPT.
having has a pre-treating phase is when you train a model on the language modeling task so during this phase the model gets really really good at predicting what what comes next right so so it's like completion so it's like you just say it's like um to be or not even complete with like to be so so it's very good at that However, people realize it's like, okay, completion is good, but it's not very useful in day to day. Because let's say it's like, I asked it's like, how to make pizza?
it might answer with four six because it tries to complete the sentence like how to make pizza for six right so so like a lot of time it's like completion is not always as like solving a task so like that's where the posturing come in it's like you teach this model who has like a lot of statistical information about like all the knowledge of the world now how to get it to respond
in a way that is helpful to humans while interacting with it so so in this phase like it's called post training um so you can do it like um people have like multiple techniques but like the um and what i'm going to say here is not
it's not always the only way to do it but definitely a common way to do it so like in the first phase of like post training uh maybe like it's called like self-supervision you can do self-supervision um so so you can curate like a bunch of like here's instruction from humans and here's how complete some instructions so if the instruction is like write me an essay about how wonderful
yes right you can write like an essay so here's that's the response like instruction so so so you train the model to like mimic that human's behaviors of like here here's instructions and here's how to respond Another phase that's pretty common is when you actually try to get the model to maximize the chance of it generating good response and lower the chance of it generating bad response. So you can use techniques
like reinforcement learning, like IOHF, reinforcement learning from human feedback, or like DPO, direct preference optimizations. So basically, there are a lot of other techniques around. around this post-training. Unfortunately, a lot of labs that are doing it are not quite. like publishing papers about it so a lot of works that we just need to be able to like know who to talk to interviewing the right people and i try to get them to say like off the report uh but yeah um
Is that because that's a big part of the secret sauce that those commercial proprietary labs want to preserve? Yeah, so I think about it like... What makes the cloud model so different from CharGPD and like Gemini? So like the pre-training phase for a lot of companies, they have the same data because everyone is scraping the internet, right? Like everyone is basically the same data.
So the language modeling task, everyone is optimizing for entropy, like perplexity, right? So what makes these models really different during the post-training phase? So they carry data differently. They have different ways of collecting human preference and training for that. So I do think that post-training is what makes this really big lab models. are like different you mentioned a term um called sampling in in the book that um
is very interesting and you mentioned very important in terms of understanding how those models behave. Can you maybe define that for us? Simpling is really fascinating. I think it's... Actually, writing section sampling is one of those that bring me the most joy because I really like it. And I feel like the topic is really underrated. So sampling is a process of a language model.
pick a possible like pick one output out so many possible outputs right so so we talk about the language model like encodes statistical information about language so let's say i said it's like um the answer to this question is 70% yes 30% maybe a 20% no and a 10% maybe right so so the model was like
models and look at all these possibilities like, hmm, what should I pick next? So maybe 70% of the types in the big years and like 20% of big notes and a 10% big maybe. So that is a sampling process. And the language model, right, you don't
it doesn't just sample like each response is not just one token or like one word right it has to rest for like example like over and over and over again so sampling refers to like um different techniques so that your different strategies should match the model to pick the output that is more valuable to you um so so let's say just like um for example like one one thing people do use like temperature so so so you can nash the model to like pick
more frequent tokens for example like for every like in the simplest way right now you you nationality pick the most frequent most likely token um so so in that way you you would notice that the models become quite boring because it wouldn't always pick like what is a
what is the most frequently spoken, the most common phrases. For example, you can ask, hey, what's the favorite color? People are like, my favorite color is blue, or my favorite color is red. It's very simple. It would be pretty unlikely to judge something like...
my free color is the color of a blue sky reflected in the still water whatever like something creative so so if you want to be more creative you wouldn't want to notch the model to sample something that is like more um less less frequent but now's the tricky part because if If you want the model to pick something really rare, it might become incoherent as well. So basically, sampling refers to this whole family of different strategies to not just the model, to generate the responses.
equivalent like like most suitable for the task um and i do think it's like it's fascinating uh and uh it's interesting and useful because um It's a cheap way to improve the model, the application performance without having to retrain the model. It's also really useful for debugging applications.
For example, you can see maybe the model output is reasonable, but you can look at the probability. For example, if you ask a bunch of is or no questions and the correct answer is yes, but the probability for the yes is really low. then maybe the model is not confident, right? So you want to look into that. So yeah, sampling is very cool. All right, so let's go into the general topic of evaluation, which you mentioned up front is one of the key things in designing those.
AI systems. Actually, you have a sentence, which I really liked in the book, that summarizes it all that says, as teams rush to adopt AI, many quickly realize that the biggest hurdle... to bringing AI applications to reality is evaluation. For some applications, figure out evaluation can take up the majority. development effort. So you mentioned some of the specific challenges of why is it so hard to evaluate? How does one...
think about evaluating those models? Yeah. Evaluation is hard. So... i think like i realized this term it's like what i call um evaluation driven development so this comes from engineering the concept of like test driven development so the idea is that like um you develop uh applications that you can't evaluate right so so um even though it's like i think i see a lot of people i said about like the latest marketing buzzwords i think it's like
One thing I've realized from working with a lot of tech executives is that they're actually really smart. And I think, like, surprise, you become the SVP of these giant corporations because they're pretty smart. But, yeah. So... I think a lot of business decisions are still made based on return on investment. So it's really hard for people to double down on something if they can't say just like,
This is making real money for us. So it's not a surprise, it's not a coincidence that some of the most popular AI applications today are those that you can evaluate. the output like pretty clearly so for example like recommended systems like everyone has a regular system nowadays and it's because like with regular system you can tell like how much money is bringing in by like whether it's increasing like say like credit rate or like purchase rate.
right so like we say okay after we launch the firmware system now suddenly our purchase rate increase by like two or three percent and like of course i have to like minus thinking about all the confounding factors like campaigns but maybe it's not going to vary simply predictions uh it's very common nowadays because you can tell very clearly that like uh oh like how many fraudulent transactions of like you were pushing like flag and i stopped um and like for jfai
One of the most common Java use cases today is coding. So there are many reasons why coding is popular. One of the reasons is that it's a lot easier to evaluate coding. then like other because i hear jaded code right you can invite it like this is compile And this is generally the expected output. And testing code is not new in software engineering. People have been doing all different kind of tests, like unit tests, integration tests. So people know how to evaluate.
like general code so so yeah so like coding is like actually very important so so i do things it's like if like no matter how exciting a use case seem to be if enterprise like don't see a way to evaluate its outcome. It's very hard to not adopt it. So I do think evaluation is the biggest bottleneck.
for ai adoptions because unless like if we can like if we can like develop a more reliable way to evaluate the applications that application is not going to get adopted like or maybe maybe it can maybe we need some billionaires you just like still like uh like fund it but uh but yeah it's challenging so what are the key concepts um around evaluation so you mentioned a couple of those
Terms like entropy and perplexity. What are the criteria? What are the methods? What should people know about? Yeah. So you mentioned like entropy and perplexity. So it's a really fascinating concept, and the action is to guide the development of language models. But because most people today are not going to build a language model from scratch, you might redo it for fun, but not going to be as a skill.
where you can compute OpenAI. So I think like entropy and polarity is useful to know, but pretty not.
what you are going to use day to day to evaluate your applications um so so i can talk about like entropy and complexity i think they're like really really really cool concept um so so on on things i want to mention about that is just like um we want like entropy to be like lower so so so like to make things like basically more predictable so so so for example like if the model is getting really good at predicting like the next token so so that means like
now the trinity that becomes like that language become like more predictable to the model right so the entropy is not below uh and over time people find out it's like hey if i could just decrease entropy somehow users are happier. Only users are using the applications and become happy. And then the question is, how far can I go? How low can the entropy go? Because absolutely, it can't go to zero.
I think people have been talking about like is there like a low about like how far can you go with entropy like how much room do we have left to push the performance of these language models and there's this concept like a reusable a reducible loss so so like language have some certain aspect of like unpredictability right this is no i don't think we would ever reach the point like we can predict the next token like
perfectly because there's always some like um some variations in the way we speak right um so so i do think just like There is a reducible loss. And I'm not sure you saw a bunch of people talking recently about like... the end of my pre-training and i think there are multiple reasons uh could be like once like we don't have data for pre-training the second is just like
our perplexity, like the entropy of this language were pretty, pretty low. And it might be like very, very close to what could be like. theoretically possible uh so i think uh when claude shannon introduced a concept of like entropy he did some pretty fun exercise like he was like he asked a question like what is the entropy of the English language? So that could be the lower bound.
Because he did that in 1950, he did that based on a very, very short sequence of maybe 10 characters, like 10 words. And interesting thing about entropy is that the more preceding
the longer the preceding sequence, the longer the context, the easier it is to predict the next token, right? Usually, if you just tell me one word, it would be very hard for me to predict the next word. If you just say, like, I. right it was like i'm gonna go with i am i want i love i hate right but if say that if you give me like a sequence of like pretty long for example like um first of all like uh even i said like today i would like to welcome my
i can predict guess the next words will be guessed right so so the longer uh the the sequence the the easy more predictable the next uh the next token is and the lower the entropy so in 1950s like that exercise got very short preceding sequence. So I would really love to see if somebody like today do that, study that, but for really, really, really long sequences. And so if a concept like entropy is not...
something that people that build those AI systems in real life. So not the model developers, but like the AI engineers who deploy. AI systems, if that's not the kind of topic that they need to worry about or use in their evaluations, what should they do? use? What are some of the key techniques and key concepts to evaluate AI systems in production in real life?
yeah so i think like um as a lot of corporations go it should make money i think the ultimate uh metric is whether it's making you money or not but i feel like exactly like you brought right because many things can like cost whether company make money are not making money uh so i think like for applications um it's really
really important to understand the use case as well so they can design the set of metrics. And then you can walk backward from that and map it to the model metrics that you care about. um so so so let's say like uh for example like you do a text to sql uh model i feel like back in like 2023 I got, like, every week I got some engineers just like, hey, check out my new tech-to-sequit model. Yep, pretty much. I was like, wow, wow.
people would do anything to avoid reading SQL queries. Yeah, so let's say you're a data company, right? And I usually interact with your data using SQL. And it was like, oh my god, running SQL is so painful. So let's help it. or write like natural language to write that so let's say you have a tech to sequel model then how do you know that the model is good so so so maybe like you can start thinking from like
from your perspective, why do you want to develop this model? First, maybe you want to improve users' productivity. So maybe you can use a matrix of time.
to like speed to complete, like speed to completions. So maybe before, like without a tool, users would... take like on average uh maybe like uh three minutes to write a single query but now as a tool it takes only one minute right so it's like having this kind of like metrics would be very useful or or like um Another customer support, you can also like similar, for example, like before you can respond to, I don't know, users have to take like,
two hours for the agent to get to the users, but now I can respond instantly. But that's not always the case because by default, if it responds automatically, it will always be fast, right? So I need to think more about the case of like, are users happy? So that is like the ultimate case evaluation metrics. I do think it's really dependent on the use case and your companies and what you care about. But then I look backward from that. It's like, okay.
Now, I don't want to deploy the application yet, right, because I want some validations offline. to be able to know whether this is good or not. So you need to evaluate, you can create sort of evaluation systems to evaluate that. Still on the topic of evaluation, an interesting tidbit you talk about is a concept. of AI as a judge. So somebody or something needs to evaluate the system. It could be a human, it could be AI. What are your thoughts on the pros and cons of AI as a judge?
yeah um i do think it's like ai as a judge is a very promising approach um so so i do think that um i think when ai first when cheshi first came out Oh yeah, so Joshua Smith was just like...
AI is not reliable enough to be entrusted with a crucial task. But nowadays we talk to teams, I think most teams have some variations of AI as a judge going on. So AI as a judge is pretty interesting. The idea is that like you have a AI evaluating the outputs of like other AI and it's especially useful in productions so the idea is that like let's say like the model
you use a model to generate a response. And a lot of people were like, oh my god, what if the response is not safe? What if the response is crazy? What if it's saying something like, get me sued? So maybe you can have another model like YouTube.
double check this one and give a score and send back um so so as a judge is like um pretty it has been like able to show to work like very pretty strongly correlated with like human uh human judgment um and the tricky thing about as a judge is that um as a judge is not as hard to say exact as it's not it's not um subjective as other metrics like f1 score
So what that means is that when somebody says f1 score, you know what that means, you know how it's defined, right? And if I run a calculation f1 score again, even writing using my own f1 score code.
it will get the same f1 score ideally uh but for as a judge it really depends on what the judge is like which model is a judge model and what the prompt is and one thing i noticed is that like for a lot of um those judges can like evolve over time so with evaluations like ideally you want the evaluation method should be stationary so that it can benchmark the application over time so let's say that yesterday the evaluation metric was like maybe 90 percent and today it's like 92 percent
And you know that, okay, so my application is getting better. But with AI as a judge, what could happen is that the judge itself changed. So it's not comparable between 90% and 92%. And I once talked with the team. who is like a pretty big company is a pretty common scenario with a lot of companies it's like especially when they're bigger is that they may have a team that's like developed ai judges like never like they may be nice to write a prompt for ai judge maybe
The judge could be like a faithfulness score or like a relevant score. And a downstream team should use a judge. So like this one engineer came to me and said like, hey, we have this faithfulness score of like 90%. And I was like, okay, that's great. So what is the prompt that you use for the judge? And he was like, I actually don't know. And I just use it off the shelf. So I do think it's really, really tricky when you don't have control.
Or was it Josh? So we were just talking about prompts. So let's turn to prompt engineering. And you wrote that prompt engineering's ease of use can mislead people into thinking that there's not much to it. So maybe a quick reminder on what prompt engineering is in the first place and how should AI engineers think about it or approach it? Yeah.
yeah uh so i think like uh when i mentioned that i wrote i have like a session from engineering i did have few people that were on their eyes like oh my god from engineering so so a lot of things just like this but like people don't take a lot of people don't take problem engineering seriously because they think there's like not much engineering to it so maybe like it go back to like maybe what is the prompt uh so the problem is like how you communicate with a model so so i think of it
is like writing prom is just like writing i mean does it make sense so like you can think of like writing promise like human to computer communications and just as like human to human communications like anyone can do it but Not many people can do so effectively. So, so yeah, so, so like, because like anyone can say, say, okay, I can just write this prompt. Simple saying, okay, if anyone can, can do that.
it's like it's so easy like there's nothing about it and especially it's quite misleading in the early on when we had a lot of like hacky hackiness when it comes to writing prom for example say like uh some i i think one of the funniest tip i saw
on prom engineering is just like if you set out the morals like answer correctly i wouldn't give you two dollars right it's just like it's just like yes braiding the model or like um so super was like okay you just like do like stuff like that and get more to do things um but even though there's a lot of like uh how to say tweaking so a lot of like tweaking the technical instructions to get what you want i do think that it can be very uh systematic. You need to make it very systematic.
If you consider each prompt is an experiment, it should be using versions of prompts. It should be able to systematically track your progress with different prompts. You don't want to just like, you know, like...
user prompt and somebody makes random changes have no idea what's going on it's a downstream people like just like have no idea like what change of applications and why the output is different uh so so i do think it's like um Prompting is very easy to get started, but to do so effectively. does require a lot of practices and a lot of discipline to do so systematically. What is in-context learning when it comes to prompt engineering?
That is very... By the way, do you think that most of the audience would know what in-context learning is? No, but they're going to learn thanks to you. Okay, so in-context learning is actually pretty... um nowadays it's one of those things people take for granted uh but it was not it was a pretty novel idea when it came out so in context learning so now like when we talk
to a model right you give instructions you give some information so some people call like a lot of corners the entire thing that you input the model input into the model to get the model to do what you want is the context Right. So the context is very confusing. We're not going to do that later. But yes, you give the model a bunch of information to get the model to do what you want. So that is a context. And if you give it some examples like.
If I say car, you say vehicles. If I say bananas, you say fruit. If I say house, you say building, right? And it says this is the next thing. Maybe even if you say train, maybe you will know how to output vehicle. for example right so so so um i think so when you put the examples into the models the model is able to learn from these examples and output the correct uh category or is the correct output and that idea was novelist came out because before if you want
the model to be able to predict, to get the correct output, we had to change to train the model, especially for the task. Before we want to predict the category of this object, we need to curate the training data. of like name to the category and you train the model on it but now with language model is it's like you can just like
You don't need to train the model from scratch. You just need to input some examples in that. And then you get to exhibit the behavior that you want. So that is in context learning. The model learns from the context.
given to it. So I have this whole term of like few-shot learning, zero-shot learning. Zero-shot learning is like when it can do what you want without any examples at all. For example, it can say like um give me um tell me whether this email is spam or not spam and it's and give it us the email and no other examples then there's zero short learning but then have like maybe like five
emails each of them with like a label my spam or not spam and then the email wants to classify so now you have like five examples for the model to learn from and now it's like five short learning So it's a really big deal, right? Because as you said, you don't have to go back and retrain a whole model with brand new data. You can extend an existing model with knowledge without any kind of coding or training.
interesting yeah it's a big deal because it makes language models so versatile and i still use control applications so it's like what make a general model before like if you want a model for a task you need to train a model for the task but now you have like a model tree generally for language modeling. And now you can just like adapt it to any task with in-context learning. And there's a part in the prompt engineering section that I thought was particularly interesting around defensive.
prompt engineering, so defense against jailbreaking, information extraction, that kind of thing. Can you talk about that and maybe what we have learned about making those AI systems resistant to those kind of attacks? So I do think that the topic is getting increasingly important, especially as AI is being like... First, AI is being used for more high-stack tasks, right, and more complex tasks. And the second is now AI has increasing access to more tools.
and it can make changes um so so i do things that like there are um so we we do want ai like users to be safe, not just users, but also developers of those models that nobody gets to. with um it's also like when one reason that makes a lot of people like go to like proprietary models instead of open source model because i say if you use like a model like developed by companies like through the api you you can't have some like this is companies are responsible for like
putting guardrails to make sure the model behaves safely and it doesn't say anything racist, sexy. If somebody asks this question about praising Hitler, maybe you can say, no, I'm not going to do it. There's a lot of guardrails around this, but if you use an open source model.
and you deploy yourself. You're responsible for it yourself. So I do things like, of course, open source model developers try their best to make the models safe as well, right? But they also have less visibility into how... the open source models are being used so which gives them like less uh information for them to like make the models safe so so i do things just like um defensive like is very important it's like it's just it refers to the process of like writing some prompts in a way
that makes the model safe so for example like maybe um nobody actually knows what the chat gpd system prompt or like cloud system prompt is but i i bet that they contain a lot of like languages. So it's like tells a model, like do a response. It's a Skype request, we are asked this, say this, do not be this, do not be that, do not be that, right? And just maybe to make sure this is like every...
So there's a user prompt, but there's a concept of a system prompt, which is like the prompt behind the scenes, right? Like the prompt behind the prompt that tells once you as a ChatGPT user have typed in something. then there is a second layer that tells the model to do and not do certain things. Is that fair? Yes, Matt. Matt, you think you're going to be a great teacher? Because I read your book.
thank you yeah so so you do have like the user prompt and the system prompt so so the system prompt is like say i'm an application developer and i develop an app of like hey, given the disclosures for a house, then users can ask questions about the disclosures, right? So as an application developer, I write a system from like, hey, model.
act like a real estate agent and given a disclosure the users ask you so act like professionally be nice be kind be brief and then users promise like here's a disclosure tell me how big is a lot like how how does that compare like in any noise complaints so your system prompt is like built by like created by the system application developer. And user prom is when users interact with the applications via whatever language they use. So yeah, system prom is when you need to put in all this.
languages to make sure that the application acts in a way you want it to act. All right, let's spend a couple of minutes on another very important part of AI application architecture, which is RAG. Maybe a quick reminder for folks. about what rag is. And then one specific question that the topic sentence that caught my attention is when you wrote that many people think that a sufficiently long context or context window will be the end.
of rag but like you don't you don't think so so what do you mean what do you mean by that yeah um so i think it's like um originally uh rack was developed to get around like shorter contacts right so so i think like original and hook and she did like pretty like 2017 or something maybe and she refer reference my book uh but but he has ideas it's like for a lot of like knowledge uh base uh knowledge
intensive applications or questions and where you cannot fit the entire knowledge needed in the context, then you need to retrieve it from somewhere. For example, if you need to rely on Wikipedia and you cannot fit the entire Wikipedia. in your context. So maybe you need to find the article most relevant to the question.
retrieve it and put it into context instead. So that's the idea of original. It was designed to get around that issue. And then some people were like, okay, as a model, get a real long context. Maybe you don't need to react anymore because you can just dump it. entire enterprise database into the context and you can just retrieve it. So I do think that's like the questions of like whether long context can make React obsolete.
is similar to like will like larger with larger computer like laptop memory will make data center oscillate so i feel like no matter i feel like no matter how big my phone memory or my laptop memory is, I will always run out of memory. So people, we're only having more and more information. So I think we always expand our usage to fit in whatever context length.
doesn't be available. And the second thing is that just because a model, I think the second reason may be actually more important, at least for now, is that just because a model can fit in a million token context doesn't mean that it can process that million token efficiently um so i think that's like for example i was actually using um Claude and Chachapiti for some of the novel writing. I actually really want to write a math novel. And what I found out is that
Because I want to write story, you need to keep track of events in the past. And I found out anything that's over 10,000, if I input any text, it's around 10,000. token slang like the model was just like company like forget like confuse timeline you would think that this character actually met the character instead like it is like really not efficient at all unlike understanding large context and i feel like um some some model developers when they have the drag guidelines they have
some guy like, hey, for anything beyond these X things, maybe try to use Rack. If it's shorter than that, you can dump everything into it. But really, it really depends on use case. And I think people really need to test out the context efficiency for there. for their application all right so as promised let's close with everyone's favorite topic of the day ai agents so uh
I guess let's define the term, first of all, because everyone seems to be using different definitions for the term. So like, what is an AI agent for you? That is, I feel like, I feel like it's a... trap question you were like okay everyone disagree on the term now tell me what the term is um by the way matt do you invest in any asian companies
Of course, I'm a VC. I have to. The two things I need to do as a VC is one, I need to have a podcast. Two, I need to invest in AI agents. Those are the rules. So I think it's like, I actually like... A lot of the early reviewers, they have a very, very, very extreme bipolar opinion on Asians.
um so so i decided it's like okay for this case like asian is not a new term it's not like a term just like no one has ever heard of it's just like a term used in ai for for very very long time so it was like okay let's just go back to the basics and i just go i just take
some like a textbook from the 80s and 90s and seeing how people define Asians it actually makes things a lot easier so actually based my definition in my books based on Peter Novick's and Russell's books from from the 90s it's like It's a really good book. So basically, they define agents as anything that can perceive the environment and interact with the environment.
So it has two components, like the sensors to get information from the environment, and then it can perform actions in that environment. So what does it mean in the context of AI power agent? Power agents already can interact with the environment. For example, if you ask, hey, search the internet, that means it's retrieving the information from the internet. And if you say, hey, send an email, then that is mean like, acting on the environment by sending out information.
It's defined first by the environment it's operating. So let's say that ChatGPT operates the internet, then it's like the environment is the internet. If you have an agent, in Gmail. then gmail is that environment if you have a coding assistant agent then maybe like whatever the coding editors that you use like vs code or whatever or like terminal is is the environment like uh so so so like it's characterized by the end
environment it operates in. And then given the user's task, this agent leveraged the tools. it has access you.
to perform the task and how does it do it it will need some kind of like a brain that determine like hey what task is this what tool should i use how do i evoke the executions and how do you to me that this task has been accomplished so in terms of the ai powered agent that brand is LM like it's a model so we have like the GPT-4 power agent or like cloud powered agent so if you think about like AI power agents you basically think about like
hey, what are the tools that it has access to? And second, how good is planning for ability? Because as an aspect, the environment is pretty much defined by the agent developer and the user and the task is supplemented by the user. And then there's a central concept to agent, which is planning. How do you think about it and what is specific about planning from a model perspective? Planning is very, very, very hard.
um yeah so so i think it's like we uh basically we talk about like agent has like two big components that we like a lot of this like we need to work on is like the tool use right a different set of tools the agent has access to and the planning how to use those This sets of tools efficiently to solve the given task.
so tune use actually like in the early days we already have the function calling so tune use what is the tune use like evoke a tool by calling a function supplementing the parameters to like do the function right so function calling like we pretty much like understand a lot more about it but planning like how to use those tools effectively like how to come up with like a
a roadmap, an outline of plan to zone the task is really, really hard. So even something very simple can require a lot of reasoning steps and a sequence of actions. so planning is also not a new problem maybe what is new is that using LOM to soon plan But planning itself is not new. So planning at its core, you can think of it as like a search problem. So what that means is like here, given a task, like you have a goal over there. Like there are many, many different paths.
towards that goal maybe like maybe like uh maybe like uh you need to turn left first turn right first all right there are many different ways so like a planning is basically you search through like the entire possible paths and choose the one that is the best path. Or if there's no path, you need to determine if there's no path.
There's no possible way to solve this task. So planning is a search problem. And I think in many AI textbooks, even the books I mentioned from Christian Novick and Russell, they have like a huge section on planning. um so so so the challenging um with for lm for planning is that um i feel like i can pretty go into like multiple minutes here uh but i think this is like um I do think that planning is you need to be able to understand, not just generate a sequel.
not just like knowing not just like saying what you do next but you should be able to like predict what would be the outcome of that does it make sense like for me it's really hard for me to decide should do a or b if i don't know what is the output of that was the expected outcome so so so so i do think it's like there are certain techniques that make our embedded planning so like for example like when um
like not just like before it's trying to like do an action try to like predict like if i do this action what would happen like if i want like if if you turn left and turn left to cliff you probably don't want to do it right. Planning is incredibly hard.
and i do things i think in in my in my book i have a very long sessions on planning actually made it public on my website on the session on planning so if what i'm saying is like a little bit too abstract go read the blog post is free and it's online so so basically the idea is it's like can lm plan like is it any
fundamental reason why LMS cannot plan because we have people like on the one-sided spectrum like Yang Le-kun from Meta who's incredible scientist but he also he also said that's like LMS like autoregressive LMS cannot plan but and they kind of disagree uh i think just like maybe like we just don't give lms enough like tools to be able to plan effectively or maybe like we just don't have the models good enough maybe
stronger models to become better at planning uh so i think that is that the session discuss that and so discuss like different tips like how to get the model to plan more effectively uh so so for example like um to plan well, you need to understand the set of tools you have access to pretty well. So if the tools are actually confusing to use, that's actually very hard for the models you originally plan, right? So you need to look at the tool set the model has.
to like change the name to make it more understandable write better documentations or maybe it breaks the tool into like more simpler tasks one thing i want to say about planning is like actually quite not obvious and i find quite fascinating it's like actually hard for us to like create data to train models to become better planning and the reason is that like when we ask humans to generate like what they consider the best plan for actions so for a task it's actually like
not quite the best plan for AI because what is easy or efficient for humans is not the same as easy and efficient for AI, right? So for example, If browsing is like a thousand websites and summarize it, to summarize a thousand websites would be really boring and slow and tedious for humans and as a humans i probably could not do it actually i did it for my for for my like uh i tracked like a thousand
ripple so i did it one point but like it's just like a very tedious and painful task but for ai it's actually really easy it can just like browse a thousand website and like jerry summary at the same time so so one challenge to like Jared data to train the morals for better planning is that we can't quite rely on humans. laborers uh like uh annotators to do so uh so so a lot so like there's a whole school of like how how do we come up how do we know like generate like good plans for ai
to learn from so you can learn better at planning. All right, so it's been wonderful. There's more chapters, more topics. We could talk about data set engineering. could talk about inference optimization and all the things. But hopefully that gave listeners a good flavor for what you discuss in this book, which again... is amazing and I very much truly enjoyed and fully recommend to anyone.
interested in the topic of AI in general and AI engineering in particular. So the book, which is both in electronic format and it's starting to ship in physical copy. My physical copy is going to... arrive soon, I'm told. There is a GitHub repo. which is associated with the book as well. Is that right? Yeah. So I think in the process of writing the book, I went through so many resources. I think the book itself referenced.
over a thousand links and i myself personally went through like a lot more links um so so i thought like some a lot of it was very helpful so like the github repos has about like a hundred like of the resources i found like the most helpful in the process of reading the book so that like if you just go through the resources directly i think there's a lot of great learnings there uh from from like from everyone um and so there's a response so he's just like
table contents and like summaries for each chapter and some prime examples and stuff like that yeah is this not a tutorial book so there are no coding examples um it's pretty I hope it's a good thing because I felt like... All the frameworks today change so fast. So I feel like any coding example using any of them is going to go outdated now pretty quickly. Wonderful. Thank you so much. I think this book is going to be a major hit.
Really appreciate your coming on the pod and telling us all about it and sharing some of the key insights. Really appreciate it. Really enjoyed it. Thank you so much. Thank you. Hi, it's Matt Turk again. Thanks for listening to this episode of The Mad Podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't already or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from.
helps us build a podcast and get great guests. Thanks and see you at the next episode.