HockeyStick #2 - LLMs in production - Chris Brousseau & Matt Sharp - podcast episode cover

HockeyStick #2 - LLMs in production - Chris Brousseau & Matt Sharp

Apr 08, 20241 hr 39 minSeason 1Ep. 2
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Decoding the Past, Present, and Future of Language Models

Delve into the realm of language models with a comprehensive exploration spanning from the foundational Bag of Words approach to the revolutionary technologies of Transformers and GPT. This script not only unpacks the technical evolution and mathematical underpinnings of natural language processing but also projects the future trajectory of these models. It highlights expert insights on the societal impacts, the convergence of artificial intelligence with human cognition, and the ethical considerations of AI progression. Moreover, the discussion extends to the significance of open-source efforts in shaping this dynamic field. Aiming to provide a profound understanding, this guide navigates through the complex landscape of AI, language models, and their implications on future technology and society.

0:00 Welcome to HockeyStick: Unveiling the Power of LLMs

01:08 Meet the Experts: From Meetups to Authorship

03:16 The Hockey Stick Moment for LLMs: Breakthroughs and Realizations

07:48 Coding with LLMs: The New Frontier for Developers

15:39 The Pitfalls and Limitations of LLMs in Practice

21:43 Building vs. Buying LLMs: Navigating the Trade-offs

32:43 The Cost of Crafting Your Own LLM: Insights and Advice

42:48 Deciphering LLMs: A Crash Course in Language Features

50:44 Defining Language: A Philosophical Dive

51:33 Exploring the Essence of Language and Communication

54:31 Diving into Language Models and Their Evolution

55:08 From Bag of Words to N-Grams: The Evolution of Language Understanding

58:35 The Leap to Bayesian Techniques and Markov Chains

01:01:24 The Breakthrough of Continuous Bag of Words and Embeddings

01:09:43 Unveiling the Power of Multilayer Perceptrons

01:15:08 The Revolution of Attention Mechanisms and Transformers

01:26:37 The Hall of Fame: Landmark Models in the LLM Landscape

01:35:06 Predicting the Future of Language Models and OpenAI's Position

01:38:48 Concluding Thoughts and the Future of AI Research

Transcript

Welcome to HockeyStick: Unveiling the Power of LLMs

I'm Miko Pawlikowski, and this is HockeyStick. LLMs, or Large Language Models, are taking the world by storm. This breakthrough artificial intelligence technology promises to fundamentally reshape the way we work with computers. Over the last year, we've witnessed its Hockey Stick moment, and as of early 2024, We're firmly in the Cambrian explosion phase. Today, we're taking a deep dive into how this models came from humble beginnings to making people scared of imminent Skynet.

I'm joined by two experts, Chris Brousseau, staff machine learning engineer at JP Morgan, and Matthew Sharp, MLOps engineer at LTK, the authors of "Production LLMs" currently available in early access at manning.com. In this conversation, we'll cover the intricacies of human language and how machines can understand it.

Give you the vocab to sound smart to the next family gathering and discuss the various mathematical ideas and models ultimately leading to LLMs, as well as some noteworthy examples beyond Chad GPT. Welcome to this episode and please enjoy.

Meet the Experts: From Meetups to Authorship

where should we start? How did you guys meet? we happen to both live in Utah, and we actually met at a meetup. It was actually an MLOps meetup, was the primary one where we met. It happens once a month and we'd get together, and so that's our origin story. we became friends through there, started helping each other, with, content creation, Chris was starting a YouTube channel, I write on LinkedIn, just giving each other feedback and helping each other out.

It was especially helpful because I was trying to figure out how best to present a lot of the material that's in our book now. how do you explain a transformer model? And Matt was fantastic about helping me, find my voice on YouTube. Okay, so going from meeting someone at a meetup, to committing to spending a a couple of years working on a book from someone: that's a little bit of a difference. Was there any particular moment where I just clicked? "Oh, we need to write a book".

How did you come up with the idea? I was approached and, I would love to write a book, but I don't know a lot about that process. And obviously, I didn't really have an authorship voice. I am not experienced in content creation. And while I was going through the process of talking with some different publishers, Matt approached me and said: "Hey, I was a technical reviewer on the fundamentals of data engineering by Joe Reese and Matt Housley.

And so he had experience and he had, subject matter expertise, and he was giving me some advice and I said, "You know what, why don't you just come on as a coauthor?, You obviously could help a lot here ,and I need it, so let's just do it together".

yeah, I think that it worked out really well because Chris has that background in linguistics, he understands the natural language processing side better than anyone else I've met in person, and I was coming more from the MLOps side, how do we actually deploy these things? And so I think it's really rounded out our book better than, anything else I'm seeing out there that you could buy and read. getting that diverse perspective, I think, really helps our book out.

The Hockey Stick Moment for LLMs: Breakthroughs and Realizations

I was very excited when you said 'yes' to coming onto this because since last year I think in most people's minds sometime early last year with chat GPT. All of a sudden, everybody started talking about large language models, and some people started worrying about, impending doom and robot apocalypse, and all of that. But from a perspective of someone who's worked, with that for best part of a decade now, I'm wondering.

what was the point when you realized that these LLMs, they're really onto something and they're moving from, a demo to an actual legitimate technology that's going to change things. What was the hockey stick moment for LLMs Oh, boy. for me, without a doubt, that was the release of T5. And looking at Google's paper about the text-to-text transformer, that set really the groundwork for prompting, right?

They had a whole bunch of different tasks that you didn't have to change anything other than some statement. For the model to do that task, and then a colon and then whatever your input was going to be anyway. that was groundbreaking to me. I had been messing around with GPT2. I'd been playing with that and trying to shoehorn it into a product where I was working.

T5, did everything that we were trying to do with GPT2, and it was incredibly flexible, it was easy to fine tune, and for me, that was the hockey stick moment that "oh wow, no, they're really cooking". when is that? for anybody who hasn't heard of heard T5? I think it was 2019, Yeah, exploring the limits of transfer learning with a unified text to text transformer was October in 2019. it came out in October. I think I picked it up in November-December of 2019.

Yeah, I think for my hockey stick moment, like I was, in the industry been paying attention, obviously GPT2 coming around, T5, etc. But wasn't really seeing the adoption that someone who's working in MLOps cares more about I was seeing, , these models can do really cool things, but people weren't caring about them. Sam Altman even said it was like, "we didn't think GPT-3 would be that big of success. We thought that would once GPT-4 came out.

but I just remember, January 2023. ChatGPT's been out a month. it's still essentially in beta. They just released it to get feedback and to start collecting data. to start improving their model. but it blew up, right? I just remember being at a church function and this guy sitting across the table from me who has no idea anything about AI, right? I was stuck in this table for an hour and all he could talk about was GPT-3. he was obsessed with it. I'm like, oh, wow.

even people who don't know anything about, machine learning or AI or the industry were like, really going gung ho and his wife was an English teacher. she was really scared of it and was like, "how are we gonna help kids learn how to, write and read when they can just go online and now cheat and write these things and stuff".

The very beginning of what, like everyone's had conversations about now, but like he talked about how his brother in law owned a website that made fake articles you can think like the onion and so once it came out in that month like I said chat GPT still wasn't a product yet, and anyone who's been following it knows a lot of those demos just shut down and then never came back up His brother in law ended up firing like a hundred writers because he's

like

"Oh chat GPT can make these funny fake articles and we're good, right?" that was my hockey stick moment of "okay we really are changing when some random guy at church is talking about it all the time". Yeah, I love that example. But even for people who are in tech who weren't directly following that very closely, that was a scary moment. I remember when I first used a copilot, I was like, what, it just does that. And three out of four, it would actually work. that was a scary moment.

It reverberated through a lot of levels of society, including, our own.

Coding with LLMs: The New Frontier for Developers

And, I think in many ways, technology and writing code might be the easiest use case for, this kind of models, right? Do you agree with that? I don't know if I completely agree with it, because, code is incredibly syntactically dependent, right? every developer who's worked with JavaScript or C++ and then moves to Python, they feel it, right? That's one of the biggest complaints is "I hate Python syntax".

"I hate that white space matters", it's a little bit more complex than just repeating whatever natural language happened, but you're absolutely right that is one of the best use cases so far. because, it's better structured than just spoken language, or is there any other reasons that make it so well suited for that particular application? programming languages are not real languages, right?

one of the things that makes it simultaneously very well and ill-suited for it is how much gets repeated, You use the exact same words. The exact same tokens to define every function that you make, but then the function's name can be whatever you want. And so using the exact same tokens is awesome. That provides landmarks for the probability as it's going through all of this.

But then that input to just say whatever you want and put it in camel case or snake case or whatever, tons of different formatting for functions. it makes it a little bit more difficult. Especially while you're trying to tokenize that, one of the big benefits with code is the amount of data we have around code. lots of people are writing code. they all have very similar ideas of what they're trying to do, of what they're trying to architect, of what they're trying to design.

and so we're not necessarily worrying about, hallucinations or fake news or, people disagreeing or other things like that. there's just a lot of data, that all agrees with each other and pushes in the same direction. It makes it good. there's obviously some negatives of just assuming, some of these LLMs writing code is going to do things well, but, I think Chris highlighted that already. it's actually really similar to how regular languages work.

If we have more python data, like Matt's saying, it's going to do better at python.

And that can create a little bit of a positive feedback loop with LLMs, where a lot of people want to get into python, and they're very good at it, but then when you look at emerging languages like mojo, for example It's really difficult to find that data and so LLMs are worse at it, similar to natural languages that have a lower number of speakers, a lower presence on the internet, So is the solution to use an LLM to generate a lot of Mojo and make it a significant percentage of GitHub?

that'd be fun, dude. I think there are some problems with synthetic data that can lead to stuff like model collapse. I don't know if we're going to see that in the code space, though. I think we could see that in natural language. So that might be a valid solution. Okay. the date is 13 February, the day before Valentine's Day 2024. I'm going to ask you for a wild prediction. Where do you see that going?

Should, all kinds of, or maybe any subset of programmers who, produce code as a job, should they start at least worrying? Is that something that's going to, decrease the pool of available jobs, no, I don't think it's really going to impact the amount of work.

I just think about my job, and even when I'm in very technical roles, and I'm spending 50% of my time on the keyboard, still, it feels like a majority of the work is still just communicating with stakeholders, understanding exactly what the problems are, technical writing, design docs, really understanding at a high level, what you want to build. To be fair, programmers have been automating the 'writing the code' portion forever, right? From the beginning.

yeah, with massive amounts of like scripts and configs that they use. And that's why they love Vim or Emacs still, right? It's because they have it configured just right. And they can move really quickly, because it provides a lot of that automation for them already, but this is just helping junior engineers already have all that configuration and set up really quickly, right?

It mostly will just make our jobs a little bit easier, it doesn't remove the need to really understand the engineering aspect, the architecture aspect, the design aspect that still is involved with coding. Oh, yeah. this is why we love comparing LLMs to a printing press. That Johannes Gutenberg. Because did that destroy the writing industry? All it did was it destroyed the monopoly that certain organizations had on publishing books.

Before you had to get a scribe and you had to pay the scribe and you had to have access to scribes You couldn't just walk up to a printing press and hit it and then boom you have a book. You have to have knowledge You have to have an idea. The printing press just gives you a lower barrier to entry Which is what we love, right? For coding, I think Matt is exactly right, that it's a lower barrier to entry for junior engineers to be able to produce significantly better work.

and in some ways it actually accelerates it, because when you copy and paste what an LLM gave you and it doesn't work, you have to go figure it out, right? With the junior engineers, it also helps speed up senior engineers, and staff engineers and principal engineers. it's good, and lowers the barrier for the entire industry, we like that.

Yeah. I've lately been spending lots of time writing chapter 10 of our book, and in chapter 10, we actually go through a project, where we help you build your own co pilot and we build the VS Code extension to get it in. if you want to be running your own LLM on your own computer with your own data, so that way, you can get your own things. we walk through all the steps to do that. And in some aspects, it's interesting cause sometimes. adding an extra feature, made the model work, right?

there's still just so much to learn about it. ultimately, it comes down to your data, right? how good is your coding data? is really how well the co pilot works, right? SQL is one of the most repetitive of all of the programming languages. but true skill with SQL does not involve being good at SQL. It involves knowing the data, right?

It's knowing which tables to query, how to merge them, how window functions, all of that stuff, knowing exactly what you need to be looking at is the true skill in SQL. And we're hopefully getting to a point where we can help the model know the data, right? We can give it some sort of context for the data that it's going to be looking at, so that it can generate good SQL that's a really good point. I've actually had, lots of mentees who are trying to learn SQL for the first time.

I said

"just use ChatGPT", generating SQL is actually something that's really good at, you don't need GPT-4, like even GPT-3, like even GPT-2, it's not hard to generate really good SQL syntax. Cause it's so simple, it follows a very similar structure. But ultimately, you can have it write the SQL, but you're going to have to go back and figure out how to connect all the pieces and understand your database and understand your data.

that's a perfect example, understanding how to write the code is only half the problem.

The Pitfalls and Limitations of LLMs in Practice

Understanding how to integrate it is really the bigger problem. What's the most terrible use case, that people are currently trying to use LLMs for? What does LLM in general, or LLMs, what do they suck at the most? I'm going to say they, they suck at, sequence prediction, which sounds so off.

Because that's what they're made for, but one of the things that I'm seeing people do, is try and automate entire workflows with LLMs, and they're trying to get the LLM to just do the whole workflow and they suck at that what they need all of this stuff to help it.

They need tools, they need rag, they need specific fine tuning landmarks and they need few shot prompting, they need all sorts of stuff to make it work, and then it's still up in the air about whether or not it will do the right task in the right order. Yeah, I was thinking, I don't know how much I'm seeing this.

But, three months, six months ago, I was hearing a hundred horror stories about, essentially CEOs being like, "we need LLMs" and like their magic, they can do anything, And so it didn't matter what the problem was, "oh, we need to, do outlier detection using LLMs". No, use stats for that. yeah, outlier detection is really a statistical problem. It's really a data and math problem. LLMs are good at natural language.

And so when we can solve a problem using words and communication, that's when LLMs can get in. But problems like, outlier detection or weather prediction or these other things, we have, algorithm. stock market prediction, Super Bowl prediction, All these things, we have better ways to make predictions. And it's called math, right? Fourier transforms, other machine learning algorithms, other things like that.

LLMs are not good at doing those things, cause we don't talk about them in natural language. we've invented other languages like math just to describe them And that's why they're not good. we can make tools, you can build functions for an LLM to use to do Fourier transitions and whatever else, right? But getting the LLM to know that it needs to do that is really difficult.

Probably just as difficult to, as explaining what the Fourier transition is to an LLM within your training data to get it to be able to replicate it. This is one thing that makes it almost miraculous when stuff does work, and that's that feeling that we're chasing right now, and that's the replicability that we're trying to help people get to in a book.

how do you actually do it, and how do you make sure that your scope is small enough, that it will work repeatedly and you can build a product off of it, that's difficult. I'm a big fan of chess. And, since ChatGPT came out, lots of people have been making memes, or just

like

"Hey, I'll play ChatGPT in chess", and ChatGPT can play chess because we can talk about it in language, right? Like E4, move the pawn, or knight to g6, whatever it is. we have language of it, but ChatGPT has no idea. It has no idea the model behind those letter number combinations. all it knows is that there's certain things it can do, right? it writes words, and so when they do this, and these like videos or memes, like they just let ChatGPT do whatever it says, right?

it just magically creates a knight out of nowhere, and magically, will take its own pieces as it moves its pieces around, it's always pretty funny. And even though it's cheating the entire way, it almost always loses, right? Cause It doesn't have an understanding of chess, like it doesn't have that model underneath it. sure we can talk about it in language, but not really, right? So we, we still have better ways to play chess, alpha zero, et cetera.

Stockfish, like there are engines out there that play chess really well. And we don't need to make LLMs good at chess, but that's a very good example of one of the things it's not good at. I've seen someone on Twitter who said "I'm gonna give LLM $1000 or whatever initial amount, and I'm gonna ask it how to best invest it. I didn't follow where it went. But I think a lot of people had the same idea. this is some kind of genius system.

I'm just gonna be its flesh and bones agent in the real world. and hope for the best. So I think that kind of goes back to your chess thing. So excuse me for that, but I have to ask you the AGI, Artificial General Intelligence. Any chance for that happening anytime soon? What's your prediction? not with our current systems. No, I don't think AGI is ever going to come out of quadratic equations, like not a single chance.

maybe if there are better dropping sub-quadratic replacements, stuff like hyena, I've tested that out. I think it's really cool. But, the fact that attention, the query key value attention, ultimately generates complex numbers. I think that is a little too much for AGI at the moment. So you're not one of those people who secretly hope that OpenAI has something they're gonna release soon. I don't think they have it, right? I'll be hopeful, sure. If it comes out, that's great.

Yeah, I'm of the same mind as Chris. I hope they keep pursuing it. we've gotten major breakthroughs from what they pursued. It's very possible AGI will happen in my lifetime, I'm still pretty young We keep on making advances really quickly, but are we relatively close to it? Probably not. No Oh, the thing about progress though is that it's very rarely linear, It tends to have a very weird curve. So that's why all the predictions are so funny, but hey, I had to ask you anyway.

Building vs. Buying LLMs: Navigating the Trade-offs

No, I think it's a great question. Okay, let's delve a little bit into, a portion of your book, It's basically describing the two options that you have today. you can either go and pay some money to OpenAI, maybe Google, or somebody else, or you can build, So you've got buy versus build. Could you talk to me a little bit about how someone would decide about this as of february 13, 2024. What's the things to consider, and what's the weights that you would put in, and biases?

the basic consideration is just your use case, right? If you just want to test something out, you're a student and you don't have a lot of budget, and you want something up and running so that you have LLM experience, I would say just, shell out for that, ChatGPT+ or buy Anthropic or Google Bard has a fantastic API, or I guess Gemini now just do it. it's not that big of a thing.

If your product that you're trying to ship is inconsequential and you don't need it to be right every time, you just want to sprinkle the AI pixie dust on it, just buy it.

If your use case goes deeper than that, though, if you want to be able to build your own, if you need to make sure that it says the right things all the time, if you need it to behave a little bit more deterministically, There have been probably a thousand case studies in the last year of people building products on top of ChatGPT and then OpenAI rolling out an update that changes how chat GPT behaves, and they don't have any way to measure all of the different

ways that it will change it, right? There are 176 billion parameters in GPT-3 alone, they don't know it's going to break your program down the line. they're just going to update it for what they consider to be better. And those programs break constantly. that doesn't mean you can't fix them. It's just a much bigger problem of maintenance, than I think a lot of people are expecting going into it. So If you want to have to maintain it less, build your own.

Yeah, I think the other aspect is like you want that control, right? there's lots of examples of companies who, essentially built a small shell around ChatGPT that did something unique. And then, months down the line, now ChatGPT just does that out of the gate, right? their value proposition just completely disappeared. And that's because they didn't have control over the model. They didn't have, control over, what it did it's just interesting, right?

Because I say these things and things have changed over time. But when ChatGT first came out, it was free, it was a demo, and they were specifically doing it to collect data. And that's what they did, they used collected data to improve their models. And that's what they continued to do for a while, right? Oh no, they're back. They it's terms and service, right? If you want them to save your chat, so that you can return to it and ask more questions, they get to train off of your data.

So if you want to put anything private or sensitive in there, like it's over, you've just leaked it.

they're back and forth about what data they're collecting, what data they're not collecting, and if you're with an enterprise customer, like maybe you can make certain rules and things like that, and oftentimes they won't, it's a minefield, for how people are using it, and so it's just something important to take into consideration, if your LLM model is doing something magical, that's really core to your business, that is really driving customers. You want to control that.

You want to make sure that the model is working exactly as intended. You're not getting updates randomly, that break your application. You're also controlling the data flow, you're making sure that you're not accidentally training your competitor's model, and other things like that. And there's just lots of aspects where it's just important to make sure that you own it. And, no, that's not necessarily everyone's concern, right?

if you're a student or you're just doing some side project or anything, there's lots of APIs out there that are very cheap that can get you up and running, there are literally hundreds of hugging face spaces that are free APIs. With, have LLMs running behind them and you can just hit them whenever you want, right? unless you're queuing behind a thousand other people. yeah, exactly.

I liked the example you gave in the book, I think people at Latitude, the Dungeons & Dragons people would agree with a lot of what you're saying now, but can you tell the story of what happened with them? Latitude, is a local company, that was here in Utah. it was put together by, two guys from BYU. GPT-2 came out several years ago. They're like, "Oh, this is mind-boggling.

Let's build a game off of it!" And what they came up with was like a dungeon crawler, a text based game it was really neat, because it would just generate, an infinite amount of opportunities. And so it created this 'choose your own adventure'. It got relatively big in the space, and lots of people enjoyed playing it. things were going really good, and then OpenAI GPT-3 came out, they offered it to them, hey, we can, we have this new model, it's a lot better, why don't you try it?

they played around with it, and "oh yeah, this is, it's much more descriptive, it's much more interesting, it's really great", There was a lot of excitement around it, however, it turned out that the model itself, had a propensity to, generate smut, and it got really concerning people would write like, "I'm an eight year old girl", and then the model would complete it saying "....and I'm wearing a skimpy outfit", And oh, whoa, like the player didn't want that, but like the model generated it.

there became this big feud between OpenAI and Latitude about creating filters. "hey, we don't want your players doing that. We don't like that". And, Latitude's "okay, we'll create some filters" and things like that. And it devolved really quickly. Latitude being a very startup, not necessarily knowing everything they were doing, they built a very shaky filtering system, and then OpenAI was "that's not good enough".

So then they started banning players, and so eventually we got to this territory where players - paying customers would be playing a game, the model would randomly generate, something that the filtering system didn't like, and then they would get banned. Cause it's like the game just did itself. It was a very complicated time, and there was lots of back and forth between Latitude, who's a small company, and OpenAI.

There's lots of ' he said they said' going on, but ultimately, it's just this position where Latitude They had this game that was completely dependent on OpenAI's model to generate good output, and it really caused a lot of drama between the players and Latitude and, OpenAI in the background and that is a critical example of LLM was very critical to their business, If they owned it, then they could have controlled it, they could have made sure that from the model aspect,

they could have trained the model to make sure it didn't do any of those things. And then they would never need to play the little blame game, right? Nobody likes to play that game. That's whose fault is it, that the model is generating bad stuff. Is it the player who's prompting it? Is it Latitude who has some systems for tokenizing and preparing player output before it goes to OpenAI? Is it OpenAI because their model is generating that?

Is it Latitude for post processing the content from OpenAI before they serve it to the player. I don't even know if it really matters who's to blame. it's just a sucky game to play.

and that's like the ultimate example of why you might want to consider build versus buy is if you buy from any provider, we're picking on OpenAI here, because they're a big player, but you buy from Anthropic, you buy from the guys down the street, the startup that just barely came up and they're offering for half the price of whatever, Buy from anybody, and you will eventually have to play that blame game.

we had another example in there of some lawyers who generated, cases that didn't exist they asked ChatGPT about cases and it came up with a perfect response. a little too perfect. It hallucinated stuff that didn't exist. and, is it ChatGPT's fault? Is it OpenAI's fault for, allowing their model to make stuff up and behave dishonestly? Or is it the lawyer's fault for not checking it? who cares? the problem is that it's not locked down. It's qnon deterministic.

Yeah, in a way, as I was reading the chapter on that, it makes me think of using a machine to maybe do some farm, work. Let's say that you're plowing a field and you're using a horse versus a machine, right? A machine might break, but in a predictable way. And if you've got a mechanic around, they'll come and fix it. A horse can get scared, or it has a bad day, or it can be moody. And it can come up with something new. So you always have to be careful with that.

is that an accurate feeling of someone who's working with this LLMs day-to-day? You work with some kind of animal? One of the most annoying things is even if you set the seed of it, so the random generator is going to be the same every single time, you can still give it the same prompt and get something different out. The truly awesome thing about LLMs is the number of non-linear activations that are going through the model, right?

It's creating incredible, non-linear jumps throughout that dimensional space that the embeddings are in. you just can't really predict it. It is a little bit like an animal. the fact that like we can prompt engineer at all. it's a little bit telling of where we are, right? Cause like prompt engineering, you can change the spaces, the white space inside of your prompt and it can end up giving you a completely different result.

we're still in a very interesting area, where we're trying to create better ways to communicate with the LLM and get predictable outputs. But, the fact that we can do that at all is. This is a bit of a miracle, right? you can't do that with a human. a human isn't going to be tricked into saying something different. humans are tricked all the time, but not necessarily in the same way that we do with LLMs. it's a very interesting world we are in, and a lot of people are having

The Cost of Crafting Your Own LLM: Insights and Advice

that horse versus machine experience. let's talk about the cost a little bit. you mentioned that it's super cheap to pay some big company to use their thing. let's focus for a minute on the cost of actually building your own LLM. if I wanted to build one of this foundational models, Let's say that I take one of those 75TB corpora from the internet and I'm feeling particularly GPU poor that day. How much money do I need to have in my little piggy bank to get something useful?

That's difficult, man. because you're either paying for a GPU, right? Or a suite of GPUs in order to parallelize it so that you can ingest that over a short period of time. Or technically with a lot of this stuff, you can load it onto a [Geforce] 3090, I've done this personally, you can train in FP16, you can train up to, about, 13 billion parameters pretty effectively, and pretty cheaply, on a 3090.

You have to be a little bit smart about your data loading, you have to make sure you're streaming stuff you have to pay for the data storage anyway, it's incredibly slow, you have to do gradient checkpointing, you have to, do like gradient accumulation steps, which slow down the training even more, I trained a little bit bigger than that, it was about a 20 billion parameter model on my 3090, but what I don't, generally talk about is it took a year of just running to do that.

it was horrendous and that all culminated in a company giving me a cease and desist, so I couldn't even release it, so you're either paying. A lot of money, hundreds of thousands of dollars in order to get something quick. Especially with 75TB of text or more, grab your own data, get more data, and you're paying to store and to process all of that. And that costs tons of money.

Or you are not paying the money, but it takes a really long time and makes all of your shareholders really frustrated because you're ruining go to market. You're taking too long. You're not going to be the first in the space, It's a huge trade off as with many things, you can trade time or money, and training an LLM is very similar. I think they estimated, huge models that we see, like ChatGPT things. You're probably paying somewhere like what was it like a half million?

I think they say, and that's just for the training, we're not even talking about all the experts you have to pay and buy in order data curation, man. on the very far end on the expensive side. it gets really expensive really quickly to train these models, just because. buying enough GPUs in order to parallelize this to do it within, reasonable time and just the sheer volume of data you have to run through to train all the parameters.

It gets really expensive, but on the other end there's lots of good open source models that have done that main pre-training already. And so you can grab one of those, you can train it with something like Laura, which you, only need a handful of samples and maybe like 10 minutes if that, and you can train it on a very, simple GPU and you have something fine tuned for what you need, and you can get under $200 is very reasonable. $150, $20.

It's very possible to train, these models with certain methods to get what you need. So does it mean that in a kind of natural, almost biological like evolution we're going to end up with few primary models that a lot of the different models branch off of, instead of, reinventing the wheel? That's where we're at currently. I hope that it doesn't stay that way, because I really enjoy seeing new people create new models for new use cases and all this stuff.

so I hope it doesn't stay that way, but I do see a lot of value in creating industry standards, at least around how you are actually writing the binary files, how are the weights actually being stored? What do the different layers look like? I, think that standardizing what the model looks like so that you can load it as flexibly as possible is awesome.

I would like to see more open source models, which is funny considering there are thousands of open source fine tuned versions and hundreds of open source foundational models on the Hugging Face Hub right now. I want more, right? I'm greedy, man.

To me, it sounds like basically every week there is another one that's better at something and if you look at the Hugging Face LLM leadership board, it's changing by the hour, literally and it looks like a gold rush in many ways but I like this gold rush much better than the crypto one, couple of years ago Yeah, man, there's a lot higher chance that you'll come out of this gold rush with a great product than with the crypto one.

yeah, there's a lot there, and just to summarize that into one sentence, you can probably fine tune even a gigantic model for around $200 to $500. And you can go lower than that. Even if you are smart about how you're doing it, versus training from scratch, which either is going to take an inordinate amount of time or will cost thousands and thousands of dollars. So I'm willing to bet money that a lot of our listeners are going to pause this now and start Googling furiously.

How do I fine tune a model? Where would you point them as a good starting point? any particular paper, any particular, company, anything that's, a good place to start with that a bit selfishly, I would say you should buy our book. We talk about probably the main ways to train in chapter 5 of our book, I was going to say that, but, I was going to say it last, right? Cause we do go over it.

The book is primarily about production environments, but you can't really put a model in production if you don't know how to work with it. So we have stuff on fine tuning. We have stuff on perimeter, efficient, fine tuning on low rank adaptation, the whole deal. YouTube is actually probably one of your best resources right now, because it has amazing content creators that show you how to do it in whatever format you're comfortable in.

So if you're a C+ developer, there are YouTube videos on how to fine tune a model and create a Laura using llama CPP, right? It's not even all that difficult. You just have to convert a model into a GGUF format and Boom, you're there. You can do it on a CPU. it'll take a long time, but you can do it in whatever quantization you want and everything.

YouTube will meet you where you're at if you want to learn something a little bit more industry-standard so that you could potentially, get employment in this area, PyTorch has an amazing documentation, fantastic tutorials and they're one of the best at really making it feel like you're playing with, let's say "big boy Legos" You're like building the model using their little Lego pieces pretty cool If you need something Bit more high level than that.

Hugging face, I think is the industry standard for, working in between a whole bunch of different frameworks, whether that's PyTorch or TensorFlow or, whatever other framework you're working with Onyx. HuggingFace has abstracted away a lot of the difficulty of setting up models for fine tuning cause in PyTorch you have to build out the exact model architecture just to load the weights and then fine tune it. HuggingFace already has the class built for you.

I would point to those if you need more explanation, like Coursera is a fantastic place. Deep learning AI on Coursera and on their own sites felt like that's Andrew Ng's education stuff. That's where I got my start with machine learning was Andrew Ng's machine learning course on Coursera. It was Awesome. Fantastic. Jeremy Howard is also amazing in that area of creating content for people starting out and learning from beginner to advanced level. He's a fast AI.

I, yeah, I strongly recommend all of those and your book. yeah, we ingested a lot of those in order to write the book, our book is a very nice high-level overview of the key things you want to be looking at and like different methodologies from training from scratch to basic fine tuning to. model distillation to, Laura and Path and things like that. we definitely give a high level overview, we give code samples and show you that.

But, ultimately if you really wanted to get into it, yeah, there are other resources out there. I know Manning has another book coming out, specifically around all about training LLMs. there are definitely other places you can go, but. If you're looking for the quick, summarized version of all of these things, our book is actually a really good resource for it.

One other thing that I like about your book is, the part where you build up the different, breakthrough moments, throughout the world of mathematics, that ultimately led to 'attention is all you need', and

Deciphering LLMs: A Crash Course in Language Features

what is it, seven years later now? the gold rush that we're observing. but just before we jump into that, there is a little bit of vocabulary and that one needs to have in order to basically talk or even read a lot of this papers, could you. Talk us through briefly that vocabulary. I'm talking about phonetics, syntax, semantics, pragmatics, morphology, that until I read your book actually made me think mostly of blood tests and semiotics.

Could you give us like the MVP version of what you need to know about these things to be able to read papers? Oh, absolutely. Matt has been learning a lot of this too, he might be better at it than me. I will throw other jargon into it. writing this book with Chris over the last year has been, mind-opening for me.

until you can Understand these words like you were saying it's really hard to dive into the deep end but we go over in our book just because we do find it so valuable, It really helped me understand very quickly. "Oh, this is what my LLMs are good at. This is what LLMs are not", and that was one of the first things we started with but the first one semantics, that is just like the structure of words, how things go, whether or not it sounds correct. that is what LLMs are really good at.

They're really good at making sure like the semantics of words align really well. but after that, you got pragmatics, which is what LLMs have no idea about. That is all the information around. That isn't said, right? So when you say I'm going to find the eggs the Easter Bunny left, right? you have to understand what, Easter is, what the Easter Bunny is, why a bunny has eggs. there's a lot of context around it that you have to understand, and that's all pragmatics.

it's information that isn't said. And that's what LLMs generally lack. Actually, I'm gonna, I'm gonna jump in here real quick. Miko, did you like the Velkanot example that I gave in there? Yeah, I thought it was Yeah. Was that pretty good? I just wanted to ask because I remember experiencing that in Slovakia.

Like I lived there for years and that was a hugely beneficial portion to me to help figure out that 'no, tons of people have tons of ways of looking at things', and LLMs don't know about it. you would have to explain every bit of it to them in order to get them to understand the same things as you.

Anyway, sorry, Matt. I find like those two words in general, semantics and pragmatics, understanding those is going to get you significantly farther and just understanding how LLMs work, what they're doing. there's obviously a lot of other words that we talk about, like morphology and stuff. And I'll hand it off to Chris to talk about what he wants to add to there. I would agree with Matt.

Just understanding semantics and pragmatics would get you probably 60% of the way there, and you could read new papers that come out and immediately see like where are they amazing? Where are they failing? I end up using The relationship between those two, just the literal encoded meaning of your words. if I say, "I'm married to my ex-wife", there's immediately, boom, semantic problem there. How can I be married to my ex-wife? The words don't agree with each other.

Versus, exactly as Matt was saying, if we talk about Easter, if we talk about traditions, if we talk about rituals that people have, just like the stuff that you say, if you ask someone in Slovakia, they're going to respond to you. That's normal. it's a question, they respond. LLMs don't have that, and you have to have them ingest tons and tons of data in order to even get as far as giving a response. the other ones that we can think about, syntax, I would say that syntax is largely solved.

At this point, syntax is your structure around the words, like what order do the words go in for them to be correct? Is it 'I go to the store' or is it 'I to the store go' or all of that stuff. That's syntax. It's the structure that holds your sentences, your utterances together. Morphology is delving into something that I consider to be very important in LLMs. I'm not going to say the most important, cause I think that's still semantics. There's a lot of work there.

but morphology would be how words are built. what are the fundamental units of meaning the morphemes do those even exist that sort of stuff. and we don't have to delve really deep into that. That's largely solved by tokenization, but we can see. with newer models that come out that really matters. You have much smaller models that have more novel tokenization, more novel morphology that end up outperforming larger models on tasks that they didn't even train on all that much.

if we can put it all together really quick. The model solves syntax. Embeddings try to solve semantics, but semantics is difficult, and so they're not perfect. Pragmatics is stuff like RAG, your Retrieval Augmented Generation, and having repeated sequences within your training data, it gives it landmarks, it's context around the syntax and semantics.

Morphology is your tokenization, which, if I would Give that an example, your tokenization provides your model with stuff that it sees, it changes from text into what does the model actually see. And, your embedding strategy is moot if you don't have it. Just your morphology gives your model glasses, if you want to call it that. And then phonetics is the one that we haven't even talked about.

Phonetics is the reason why we are doing a podcast and we're talking instead of just texting each other or emailing each other. Can you imagine trying to ingest a podcast that's just emails? It's horrendous. And it's because there's so much richness and depth in meaning in the language that is just lost when you strip it of its phonetic, I'm going to call it a medium.

And that can lead people to think that it has to do with sound, that's the most common modality for people, but sign language has phonetics, they have particular places where they, make signs. They have particular ways that they do them to inflect and express more emotion. Their phonetics exists even outside of the verbal modality. that's important because that's where I see the most improvements coming to LLMs in the future is being able to process.

phonetic information without having to convert it into text or process phonetic information and compare it against the text. that can be incredibly helpful for your model's understanding. those are the five features of language that we break things down into in the book. And they're largely agreed upon. There are some other linguistic features that are incredibly important, stuff like dialogue, that we haven't even covered. beyond that. Yeah, we can talk about semiotics too.

That's, Charles Sanders Peirce, smart dude from the 1800s just created, a lot of structure and organizations we dive into that very lightly in the book. I don't think that you need a grounding in semiotics in order to improve your ability to interact with LLMs. But it is helpful for organizing all of these other concepts. how do we create a mental map for how stuff needs to be processed within a machine learning pipeline?

How do we make sure that we're not mixing things up and inadvertently destroying our model's ability to see things, right? If we put embeddings before tokenization, it breaks your process. it's helpful for organizing things and it's also helpful for understanding how conversation happens and how I say something and it moves through

Defining Language: A Philosophical Dive

your mind to create an interpretation. that's by far like the most theoretical out there concept that we get into in the whole book. And together you came up with this language definition as being, as a concept, "an abstraction of feelings and thoughts that occur to us in our heads". And I'll be honest, I initially thought it sucked. because it's a little bit, it's a little bit wishy washy. I wanted something a bit more concrete.

But then, as I looked up all the other definitions in different contexts, I was like, Okay, I can clearly not come up with anything better than that. So I think I'm ready to yield now and say that this is actually capturing it pretty well. Putting abstraction in it, sounds also vaguely techie, so that helps.

Exploring the Essence of Language and Communication

How did you come up with that definition? I didn't. I would love to take credit for that. No, that definition has been around for a long time within the linguistic community, and one of the best examples of why it really works is babies, right? Babies have no idea how to express their thoughts, but somehow they get it across.

when a baby is happy, we can tell when a baby is crying, we can infer that it needs something, babies are able to communicate without language, meaning that language is something that we created to shorten the conversation. The reason I called it an abstraction is we have abstract ideas. You probably come up to a situation where you're feeling something, and you don't know the words to really express it.

I think that's a pretty universal human adult thing that has happened at least once in your life. That's happened to me a bunch of times, and it really illustrates that "Oh man, the language that we use is actually describing "what's in here", it isn't "what isn't here". it's a hard concept. Once you get there though, it really helps with LLMs, because you realize that the language that we're using is a crutch. And that's all that the LLMs have in the first place.

And so this is another thing that goes towards the miraculous nature of them working at all. Is they're dealing with an abstraction of an abstraction at least. In order to communicate with us. So let's say that I buy that. my first question, would be going back to your baby example, isn't what the baby's doing some form of a language? what's the line I'd like it to what is and what isn't? what's the line between, a language and communication? I like that.

That's a question that a lot of people I bet have and It'll probably go in the appendix. We'll probably talk about this in an appendix for curious readers so the line between just straight up communication and a language is the ability to talk. there, there are a lot, but one of my favorite ones is the ability to talk about something that is not physically present. bees have communication. gibbons have communication. Babies have communication.

Babies, though, are unable to express any ideas about stuff that is not physically present, you can't talk to a baby about theoretical physics. I mean you can, but what are you gonna get back? You can talk to a baby about my Star Wars posters, right? I can point at them because they're right there, but if I'm in a different room, baby's not gonna be able to talk to me about them And that's the difference, It's one of them.

That's the one that I'd like to highlight though is that the fact that we can speak about things that are not physically right here with us, that we can point at, that's the distinction between communication and language, because babies are communicating.

Diving into Language Models and Their Evolution

But once they get to that point, it really deepens the interaction that you're able to have with them. So now, equipped, with all that knowledge, I'm gonna try to prompt engineer you and give you this prompt. I'm a five year old baby, that has language now, and who's very curious about understanding how we got from bag of words, counting frequencies all the way to LLMs and ChatGPT and people worrying about the Terminator actually coming into life.

From Bag of Words to N-Grams: The Evolution of Language Understanding

Could you walk me through the high level ideas that were important, build up to what we're seeing today. The bag of words is really easy to think about, especially if you keep your tokenization incredibly easy. Sorry, this is, I'm already out of five year old territory. You just count words. If I take that sentence, "you ; just ; count ; words". Each of those has a count of one. If I add another sentence, "I like Star Wars". All of those still have a count of just one word.

And then if I add another, "do you like Star Wars?" You and star and wars all go up to two. That's it. That's a bag of words model. why is it important? what can it do? I think that bag of words is The first model that we really have to explain being data-driven. It's just keeping track of things. if you look at a bag of words model for your workouts, it's just how often do you do certain things? how often are you doing a bicep workout versus doing a pectoral workout?

How often are you doing which thing? it's just being data driven. It's the first step, right? You're not looking at any features. You're really caring about how these things interact with each other. You're just keeping track So I guess with that information from your example, I can guess whether you, are skipping leg days, and I can see what's important to you.

Or, if I'm counting, words in U. S. presidents speeches, I can say, like you described in your book, whether it's a wartime or a peacetime president, and what they really try to get across. this is something that you can use for anything you count in soccer which players make goals how often that is a bag of words model. You're not tracking words. It's a bag of goals or it's a bag of, whatever else. So what's the next step from there?

bag of words was really monumental just because it's so simple, but it's so powerful because know words you use when you're describing sports is very different from the words you use describing politics And so just picking up on certain words and their counts helps us understand the overall subject of what it is. But it really lacked, any sort of structure, because the order of words also matter, right?

So the cat in the hat versus the cat's hat, they both have the word 'cat', they both have 'hat', but mean different things because of the order of the words, and so that kind of led to, n-gram models. instead of just simple words, we would also take n-grams, which are, n number of words in a certain order, and we would start cataloging those. And so, more than just words, we're getting n-grams.

And that is improving our understanding of the language because now we have embedded some syntax in it. We understand some ordering of words and that's able to improve our categorization. however, from there though, we're not really able to make any predictions

The Leap to Bayesian Techniques and Markov Chains

of what next words about to come up or anything like that, when it comes to bag of words or n-grams they're really more for categorization. And so that kind of led to Bayesian techniques and so not to really go deeply into Bayesian statistics, but Yeah. I'm sorry. Sorry to all Bayesian fanboys. We're going to go about as deep into this as we did to pragmatics.

it's just you know, based off of the priors of the words that came before we can then predict the next word to come up and so if every single time after in text we saw 'I am a man' then it's going to predict that the next word is man instead of other words that easily could have come up like woman or girl or boy or cook or professional athlete.

certain things that could come up that are gonna be a lot rarer Like I am an astronaut like a lot less people have been astronauts in order to say that it's gonna have a very low probability of being the next word predicted but it gives us this opportunity to look at what is the next word predicted. from there, we move on to what's called Markov chains we're swinging back towards the n-gram model But it gives us a bit of prediction next.

I actually really love Markov chains because they provide very fast predictive text like Markov chains is essentially what's been fueling like the predictive text like for Google search and things like that has been the technology that's really been leading that charge for a really long time. and it's just a very basic way that we're using Ngrams now to make predictions of the future. You can think about it there, that is obviously I'm reducing it.

that's not exactly how it works, but it's a bag of n-grams where you take a state, at each point in a sequence, and look at all the times that Previewings have occurred in that sequence, and then from that you can model probability about what comes next. Instead of just looking at each n-gram by itself, you give it state. and it's a bag of n-grams. It's really fun. It's a probabilistic bag of n-grams. That's how the chains work.

One of my favorite parts, and I like that you kept track of this quote here, that Markov models represent the first comprehensive attempt to actually model language, which is funny, because Markov was not trying to model language initially, he was just trying to win an argument. And He eventually used it to, he looked at distributions in particular Russian authors. He looked at distributions in, Russian government official speeches.

The Breakthrough of Continuous Bag of Words and Embeddings

he knew what he had and he believed in it, and I love that, what a great piece of history anyway. continuous bag of words. Is where we, start essentially taking the logic of a Markov chain where, "oh, if we keep track of where things appear and how often they appear there, then it helps us, be able to model for what could appear next", right?

And this is the first moment where we're really coming full circle all together and going right back to bag of words and just adding context for position and adding context. from the context of the bag of words, the literal counting of things, we're able to create embeddings, right? I don't know if a lot of people are aware, but bag of words is how Word2vec came to be.

Word2vec was huge in, I think, 2015, 2016, and it stayed huge, Gensim is still one of the most downloaded natural language processing libraries in Python for Word2vec and for GloVe. Continuous bag of words, just adding that one little thing. adds all this context so that we can create embeds. We can create vectors that we can compare between words. this all comes from the logic of I forgot that dude's name. Tell me the company that a word keeps, and I'll tell you what that word means.

just what's around the word. influences its meaning, which goes directly against a lot of previous linguists' thought that, syntax and semantics are absolutely not related at all. That's one of the big things from Chomsky, the colorless green ideas sleep furiously, nonsense, there's some semblance to it. There's some sense to it and taking advantage of that with continuous bag of words, we can create. like I said, these vectors that we can then compare, and that's really interesting.

that is what fuels LLMs now, is this exact same continuous bag of words modeling technique. It's been built upon a little bit, but that bag of words is still fundamental to how embeddings are created. Bag of words and positionality and, like we can get into, the rope scaling, all of these rotational, plugins that you can use to get longer sequences embedded correctly, or at least better. that's one of the hard things when we're talking about language modeling is what is good and what is better.

a lot of people like to appeal to, this is how humans do it. I don't know if humans are incredibly efficient when we do it, but. Like it's fine. then we get into the 1960s, the very first perceptrons, Before we go there, can we spend a little longer on what the embeddings actually are? You mentioned words to Vec, you mentioned the words vectors and embedding, but for somebody, listening to us, from the start, that's probably not clear what that is. can we delve a little bit? Yeah, absolutely.

So embeddings are the vectors that come out of models like continuous bag of words. when you look at a modern machine learning pipeline, there are multiple models that you go through and we just attract all of it and call it model, just one model. When you look at GPT-3, ChatGPT, it has a model that they call it, a byte pair encoding model to do its tokenization. And then it has a model to do embeddings. that model is fundamentally a continuous bag of words.

It's built on top of it a little bit with, like I said, keeping track. Not just how many times a word occurs, but how many times a word occurs in particular positions. and then on top of that, it keeps track of the, flip. It's either an odd or an even position within a sentence and it assigns it cosine or sine based on whether it's an odd or an even position.

in order to try to insert some of that meaning back into it, that was taken out from the tokenization, cause tokenization is just assign each token a number in a dictionary, and you have a way to get all words into that dictionary, and then come back out of that dictionary. So it takes all of the meaning out of it. It's just one number. The embeddings attempt to put some of the meaning back into it using positionality, using continuous language modeling techniques.

embeddings really simply, they're not perfect, they're just an approximation of that meaning, and because we are able to put it into a vectorized space, we're able to take these words, put them in a vectorized space. We can start doing things that start to make sense and start to make us feel like we're headed in the right direction. the classic example is, when we first discovered embeddings, we took the embedding of 'king', we subtracted 'man' from it.

We then added the embedding of 'woman' and we got the closest. Embedding to that was 'queen' to that, we start to get this vectorized space that starts to make sense. We start to, these words start to have connection to each other and they start to make semantic sense to us as humans. however, embeddings are still an approximation, right?

So if you were to do that with kind of every combination, it's interesting, what do you get when you start, taking words, That don't necessarily make any sense, like adding or subtracting them together. what do you get a good quintessential example of that is you take the vector for 'king', you subtract the vector for 'wolf', and you add the vector for 'prince', and you get the vector for 'village'. Or at least pretty close to it.

That doesn't make any sense, there's still lots of, okay, these are starting to add meaning, not always, but sometimes, like it's an approximation and embeddings ultimately.

it's something we're constantly trying to learn and improve If your listeners are wondering how to keep up in space, like embeddings are probably the number one thing to keep track of OpenAI recently released, logic for being able to change the size of embeddings, to me, like being pretty deep into this, it feels groundbreaking.

Because normally you have to structure these vectors so that they're all the same size and each point within that vector represents meaning negative or positive and it's very structured and not malleable and so the idea that you could take you all of your embedding space and change the size of it at your whim Is just amazing. that's one of the things that I see as a huge groundbreaking piece of technology that OpenAI is continuing to lead in.

yeah, and if you're ever in doubt for oh man, is this paper important? If it's about embeddings and doing really cool things with embeddings, probably. I think the one question for anybody to like picture that, so what's the dimension of all these vectors? Is that the entire vocabulary? Are there different techniques? yeah, currently the, number one, dimensionality that is an unspoken industry standard is 768. that's a number that pretty much every NLP practitioner knows.

like the reason OpenAI's embeddings initially were like really cool and they thought they were super dense is they were, what, 536, or 1536, which is 768 doubled, right? You're gonna see multiples of 768 all over the place here. And that's not because that number is super significant, that's just the first embedding space that we found that tended to work better than the others. So that's the more art than science part of this for It's the brute force testing.

Yeah, before going through and testing, 767, 766, 765 and landed on that one and it worked, that's the best one that we've found so far. Even the doubled embeddings from open AI offer a marginal improvement

Unveiling the Power of Multilayer Perceptrons

in that understanding space. I think we can move on to the multilayer perceptrons. Okay. a perceptron is essentially just a linear transformation of data. If you look at it from a statistical standpoint, if you have three things about something, You can just add those things together and you get a description of that thing, right? Just summing them and, that's like abstracting it a little bit much, especially if machine learning practitioners are listening to that.

Like we can do linear trans transformations. that's like the easiest way to think about it for me is you perform one. action on a group of features and you get something out of it. That's not by itself. really helpful.

once you get into having multiple layers of the, this is the MLP, the multi layer perceptron, once you get into multiple layers where you are adding these transformations together, and in between those layers you have non linear activation functions so that you can, create, you can create nonlinear relationships between sets of linear transformations. You can get into really cool spaces.

And one of the first things that any machine learning practitioner learns, at least in a lot of the cases that I've talked to is that just adding more layers does not make it better. In fact, the cool part is finding the minimum number of layers that you need in order to model the relationship between two points. that's a little bit abstract, I think the quintessential example is like detecting which type of iris flower.

It is from an image, the, we don't necessarily know how many features there are, but we can vectorize the entire picture of an iris flower. And then we can discover that the, I think minimum number of layers is like five in order to go through and actually get really good accuracy on detecting which iris flower it is. yeah, multi layered perceptrons are The feed forward networks.

Those are the basis of everything that comes after it whether it's recurrent or even Transformers have feed forward networks inside them and that's the basis of it right there. How do you choose the sizes and is it all just trial and error as well for the number of layers, the sizes of the hidden layers? Are there Not any rules that always work?

Yeah, so going through a feed forward network and this comes from trial and error, it comes from a lot of people trying different stuff, but generally you have your Initial dimensionality could be something like 768, right? Your initial hidden layer. that's a good number for it. That's an embedding dimension that we're familiar with, but then we want the next hidden layer to be double that. And then we want to go smaller and smaller until we hit our final output classification layer.

So we want to have a big jump and then small. What to think about that theoretically is you want to model the number of features that you are looking for, and then you want to just model double that is just a good way of saying all the features that we might not know about that we might not even be keeping track of. Let's see if the model can figure them out mathematically. And then we want to narrow it down. Narrow it down.

Narrow it down until we get to our actual classification, which in language modeling is what is the next word, right? Got it. So double it and then boil it down to the size that you're actually looking for across a bunch of layers and hope for the best. Okay. and that's why when OpenAI doubled the embedding layers, it was a marginal improvement, but it's predictable because that's normal. People do that.

Are there any particular, well known kind of configurations of this neural networks that just work for a bunch of problems that, something that you keep seeing over and over, or is it more custom for every problem you just follow the heuristics that you just described?

as far as model architecture, no, it's basically the heuristics that I described, and then people will experiment and tune them and find that, oh man, statistically, If this layer of the model is bigger, then it works better, but it follows that general structure. I think, one of the papers that I would point to for this is a bit, MFIT, where it was, it's basically a methodology for fine tuning.

But it experiments with gradual unfreezing of layers where when you're training, you will start with only the very last classification layer and everything else is exactly the same. And you only train that one. And then you unfreeze, unfreeze, and test each layer as you're training. And that tends to help things like even now that is abstracted within the hugging face trainer class. And that's abstracted within pretty much every. model.fit methodology because it works.

The Revolution of Attention Mechanisms and Transformers

Awesome. What's next in our journey? probably just the fact that multilayer perceptrons struggle with sequences, right? even if you try to embed things and try and keep some of that positional encoding within your embeddings, they struggle to model. Multiple things where the order of them matters, right? which language, which the order matters sometimes, right?

Sometimes it's normal to say gibberish and knowing when is, which is extremely difficult and to solve that, I don't know if we need to necessarily go into recurrent neural networks, but we definitely need to talk about LSTMs, the long term short memories, which are recurrent neural networks to, start with, but they added some really important things, which, for example, when I'm talking, you are.

Kind of consciously predicting what I might be saying, you can hear what I'm saying and you're trying to figure it out as it goes on to understand it. we call that active listening. that's what happens. long term short memories, model that a little bit in that they take the sequences and they allow the model to try to predict both going forwards and backwards. instead of just doing the one way. So that bidirectionality it's computationally expensive.

It takes a lot longer, which is why I think these are not used as much anymore, but it's really novel and it did help a lot in predicting sequences. it was phenomenal for language modeling. beyond that, they like solving the attention.

Within LSTMs, like when attention came out, adding attention to whatever you were doing was phenomenal where it added an extra layer of non linearity when it was going through and trying to search for what word might come next, it not only had all the modeling that we've already talked about, it also had the ability to search now and search for not that exact thing, but something similar. And, that just exploded in popularity because it works, it was phenomenal.

However, the difficulty with long term short memories is they're computationally expensive, they're slow, it's a lot of math that you have to do in order to get through every single layer of it, let alone trying to predict and stream those predictions in a sequence, you're going at one token per 30 seconds. And that's difficult for having models that are the same size as transformers, for example.

so yeah, it was a lot of really cool stuff that helped us solve basically how to get to the next step. It was just computationally expensive and slow. basically, not very practical in use, but important. talking about practicality, I think it's great that it's accurate, right? I think accuracy is incredibly practical. I don't think that from a customer experience that's practical, right?

Customers don't like waiting a long time for the right answer because they might be able to find the right answer in that amount of time anyway. and then from there, do we jump to the attention? at this point, we've gone through the history of, the field modeling language, building up and we finally reached attention, right? And attention is, the backbone of transformers, which is what LLMs are built off of. And, attention just adds a non linearity.

And it was just a breakthrough and how we're able to connect the words, so attention really quickly is just, creating these dictionaries, key values of, every word to every other word in the token space. and then it's able to query it. for each other word, we're able to build. importance of the other words that are important to it.

And it's in a quadratic space, so it's much more than a linear space, but it's a reasonable amount of time, to compute these kind of dictionaries, the key values, and then query them and understand the importance of other words It's the backbone of what all these, different models are doing.

and even as Chris mentioned, like we could inject attention into these previous, RNNs, LSTMs, et cetera, but, it was the backbone of building the transformer model, which, came out, in the catchy paper, "attention is all you need". where essentially all they use, a meme, right? That we've seen a whole bunch of other papers afterwards. They're like, "no, this is all you need".

or no, this is all you need, or no, you don't need, but the reason it's a meme is because they took out everything that was, supposedly novel about the long term short memory, the LSTM. They used only attention and feedforward networks Could you give us an example of what that would look like on a very stripped down thing? What does that dictionary look like? for visualization and decode. no, just for the attention itself, right? You mentioned a key value from basically every combination.

You have to pre compute every combination within the vocabulary. You can take a sentence that you're feeding in to the attention algorithm, the cat in the hat, since I used that earlier. and so essentially you would have a dictionary where the is comparing to every other word, cat in the hat, and it's coming up with assimilating metrics of the importance of all the other words.

And then you would do that for cat, it's going to do it for the in the hat, and in the cat, the hat, and it's going to come up with A dictionary, essentially, of key value pairs for all the other words, helping you understand, the importance of the other words that are in there.

and then the query algorithm, that runs, that essentially helps us understand being able to predict the next word that's coming afterwards based off of how important the, all of those kind of dictionaries are, and adding them. And so all of, this happens to happen in quadratic time. one of the nice novel things about this is that the query And key vectors, your query vector is the word that you're looking at in the utterance and your key vector is the key in the dictionary.

those two vectors are not one hot encoded. The way that a lot of we haven't even mentioned this. But that's a vector that is 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, that's how a lot of these things had been represented previously, coming off of the bag of words, The idea that, hey, we can model these things. We can create vectors that are just did this word appear. Or did it not? And where did it appear?

That was a positionality and, attention is all you need, you can immediately see a problem with one hot encoding in the it's very sparse, especially as you're getting into 768 dimensions, right? You have just one 1 and a whole bunch of zeros and those zeros don't really matter. And so one of the breakthroughs here was using dense vectors for queries and keys in order to get values that are also dense. I think one of my favorite visualizations of it, it's from Jesse Vig.

It's called BertViz on GitHub. I've used this in production environments in order to show that hey, Our model is not understanding this because look at the attention, all of it is factoring in, all of the queries are related to the key of the wrong word. If you look at words with semantic ambiguity, I think the quintessential one is "time flies like an arrow". Where flies is also another word that could mean multiple small little bugs buzzing around. How do we know that it's not that word?

It's because of the position in the sentence that we know that it is a verb. and it's referring to time and it's referring to arrow. And we can see that predictably within attention, because that word is determined to be important. That query is determined to be important as it relates to the keys of time and arrow within query key value attention. That's what that dictionary looks like. That's why it's useful.

And, I guess the representation of the importance, how do we actually come up with that I think it's dot product. we're comparing the vectors between the query and the key. dot product attention is, I'm pretty, that's not where it started, but I think that's where we're at right now. That's like the industry standard that everybody uses. It's just, multiplying the vectors together.

Essentially you take the dot product of the two vectors, and that's where we get the comparison and the relative importance values. it's not magic, it's math. kind of the same thing from time to time? Okay. And then with that, we've got the GPT, the generative pre trained transformer model. What's so groundbreaking about that? as opposed to the original transformer, they only use a decoder.

So the original transformer had attention based encoders, which changed your embeddings into essentially another embedding that was then taken by your decoder and used to predict the next word. So it had two networks linked together in the middle in order to produce Your next word and the reason this is important is it goes back to that original idea that we talked about a language as an abstraction, right?

The authors of attention is all you need looked at that abstraction and we're like, Hey, can we model that? And that's what an encoder is. When you look at models like BERT, it's taking your input and putting it into a new abstract space with lots of nonlinear trans transformations and it's taking your input and putting it into a new abstract space with lots of nonlinear trans transformations and it's taking your Incredibly useful.

And so the GPT models were groundbreaking, because they were like, we don't need that. we just need the decoder and we're just going to use syntax basically. And the thought process there is that syntax is related to semantics deeper than linguists are able to really conceptualize in an easy to understand way. We know that it's true, And we know that it's predictive with especially looking at how good GPT-3, GPT-4 are.

And even looking at the open source stuff, LLAMA is a decoder only network and it rocks, right? I have a suspicion that we're going to hit a point later where, Google is going to blow everybody out of the water with another T5, like another, version of that puts the encoder back in. I don't know how we're going to get to that point, though, because the decoder only models work so well.

And they're faster, they're less computationally expensive, because you're taking, probably, a third of the model and just throwing it away. So you mentioned Llama, and I think that might be a good segway from what essentially is, about a third of your book. so for everybody else who wants to go and jump into more details and see actual Python implementations of a lot of what we just covered,

The Hall of Fame: Landmark Models in the LLM Landscape

the book is called Production LLMs. It's available on manning.com, and I'm pretty sure you're going to love it. So going back to Llama, let's do a little hall of fame, rundown of the kind of landmark important models from the last few years. Where should we start? I would probably start with the original transformer, like they deserve credit. A lot of the, Vaswani and all, a lot of the people who wrote that paper have gone on to found or co found companies that are now competing in this space.

Whether that's Anthropic or Character. ai, those are the people that created that Transformer and they're still building on it. I think that's the first one that I'd say for the Hall of Fame. what would you say, Matt? think part of this question is what is the first LLM versus what is, the first, Hall of Fame model and yeah, like Transformers, Bert, like Bert, is incredibly powerful, I think, because it's so small, it's not in the LLM space, it's often overlooked.

And I think many companies are still looking at these massive LLM models for problems they could solve with a simple BERT model. But because they're only getting into this space now, they think immediately, hey, we have to use an LLM, right? And they didn't care in 2017. And And over what was there. and I go back, I said it before, I love Markov chains, like they're amazing and they're really powerful for what they do really well.

And even then, a lot of people could just use Markov chains for a lot of the problems that they're trying to solve with LLMs, but, LLMs. They do give that flexibility, just their massive levels of computation. I think if I was to point, to a model that I thought was just really powerful, it. It would be Bloom, actually. Bloom was essentially the first, LLM massive, large model that was built. And it was built, completely transparently. it was a research, project.

funded, a large part by, the French government. And just, it was built completely transparently and completely in the open space. and even though the bloom model today, isn't seen as, a very competitive model, but like a lot of the open source learnings, a lot of what we have nowadays is because of what those researchers figured out while they were working in bloom. we got amazing, libraries out of it from like deep speed and other things like that.

it really boosted the open source community, which has been one of the major driving factors of LLMs today, and probably a large part of why we could even write our book, cause the open source community wasn't. At where it is today, like there wouldn't be much we could really tell people other than oh, You got to go work for Google or Microsoft or how would We, know any of it, right?

Yeah. we know, about it largely because, we've been involved in the open source and we, built off of what those scientists at Bloom did. Big science. So that's 2022, right? That's a couple of years now. Yeah. and then we had llama that became important, and llama2 Yeah, even more important.

Yeah, and it's largely just because, I don't remember the username of who did it, but whoever put that PR on the original llama GitHub that had the torrent link to leak the weights, that's the hockey stick moment for LLMs, right? That's what made them available to everybody.

That's what enabled Stanford to create alpaca and show that, oh man, you can make the model better with like only 50 K responses like you don't need tons and tons of data in order to fine tune and get very good results and improve in every metric. yeah, that everything since then has just been building off of that exact same momentum of whoever leaked that first llama and Meta has benefited greatly from it too.

they now have a very open, I wouldn't say completely, but a very open attitude towards the space because they recognize how, advantageous it is to have other people building on top of their model and be considered an industry standard. Yeah they've really leaned into it recently, right? And like how big was their stock jump? right? all of the underlying architecture, right?

Like these open source programmers or even just like the video programmers, like they're able to go in and because they know everything about Lama, they're able to optimize, cuda kernels and everything. And so Lama has gotten faster and more proficient, Lama CPP, we're able to run it with, just on a CPU, there's lots of benefits that because they, gave us the architecture, it was leaked, but now, they've, leaned into it. They essentially they've given it to us.

And so Yeah, we just need them to release the data that they used to train on it And it's completely open, right? but even the data, they've told us a lot about what the data is, right? we don't have the exact data, but we know essentially red pajama, what those data sites were built off of, what they were. And so we're able to. replicate it really closely in the open source community.

Llama, I don't know, if we have a really good list of Hall of Famers because it's difficult to see what's going to stick around partially because it's so difficult to evaluate these models as opposed to BERT right? large BERT had 300 million parameters. You can run stuff to see how well those parameters are, like you can hyper tune them. you can run evaluations to see how each one is performing and still go relatively fast.

When we're getting into the 7 billion parameter range and the 13 billion parameter range and the 70 billion parameter range, it's much more difficult and computationally expensive to evaluate on that level. And we don't even have the ability to describe what all the parameters are doing. and so our evaluation metrics are difficult to gauge. You look at MMLU, you look at a lot of the benchmarks that people are running, and they're useful.

But ultimately at this stage, we still have to go download those models and test them against our own use cases to see if they perform better. And that's incredibly time consuming. like we could talk about a lot of the models that have come out, like Capybara, we can talk about New Zermes, we can talk about WizardCoder, and they're all great. I don't know which ones are going to be the hall of fame.

The next industry standard though, there's definitely some other models that we love and we talk about in our book, like Falcon, which came out of the TII and Abu Dabi, right? Like amazing model. It's, Micu. the latest Falcon is one of the largest open source models and it's come, under the Apache 2 license. So it's completely open source. the very first model that's fully open source. there's definitely amazing, progress being made and lots of different models to be paying attention to.

But yeah, One of the biggest ones to pay attention to. right now, I think is Olmo, not because it's competitive and performant, but because like Falcon, it is 100% open source. You can see the data they trained on. You can replicate exactly their experiments. that's going to be one of the biggest drivers in this field where, you look at a lot of the, innovation that's happening and it's happening over on files that people are passing around on torrents.

It's happening on like random users on Reddit are coming up with NTK aware scaling and rope scaling after that. And they're coming up with more stuff because.

They have time, and they want to help and a lot of these people are experts and they're just anonymous and that's Incredibly important for the space because we're finding that people who deal with these models and use them 24/7 Have skills that the researchers don't necessarily have and that's difficult to admit being on the research part of it But it's true. so that's the one coming from Allen Institute for AI, right? The one it has, yeah, I think they're also open source in the

Predicting the Future of Language Models and OpenAI's Position

actual training code as well. the whole they are the whole thing. That's pretty awesome. So with that caveat out of the way, hedging your predictions, we don't know what's going to happen tomorrow. Do you see any one company kind of getting ahead of the others? The GPT-4 is still holding up well against a lot of these models, which makes me think personally that they have a few. Tweaks and hacks they haven't shared, which helps with their multi billion valuation.

Do you see anybody like running away from the crowds or is it too late now? The cat's out of the bag and the progress is going to come from the mass of people. I don't know I know that, I was texting with a couple of people the other day talking about GPT-4 and, how it is still relevant, even, people talk about the performance decrease, but it's still relevant, and every week, every model is, that's coming out getting compared against GPT-4.

And they're finding that most models are more performant in GPT than GPT-4 on certain things, right? It's comparing the Rain Man to an average human where, and asking like what tasks they're good at, right? If you, if it's going to McDonald's and ordering your own food, Rain Man is not great. And you just got to find the model that's better. a good example for that with GPT-4 is math. if you need a model to perform calculations for you. That's not it.

you have Alpha Wolf, you have, Goat, you have, even just Vanilla Llama 2 is better at math than GPT-4, even though they weren't explicitly training on it. And I think that they currently have that first-to-market advantage more than anything. That's not to say that it's bad. That's not to reduce the work that OpenAI has done because it is phenomenal. But that's what's keeping them really afloat is the first market and the ease of use.

One other question I was holding, as you were speaking with, you mentioned mixed role and, What is it called? Mixed of, mix of experts. what's Yeah, mixtral. Yeah, it's routing. it's being smart and saying, hey, we don't need a dense feed forward network for every single thing. Let's have a whole bunch of sparse networks and just based on the input route it and tell it which expert is actually going to be the best. It results in much larger models that are smaller on disc and faster to run.

Is that more similar to how the human brain works? Because it's obviously not fully connected. It's got different regions and stuff like that. I would love to appeal to that. authority. that didn't rock. I don't know though, because like you look at MRIs and you can see, Oh man, this portion is lighting up when you're experiencing that emotion or seeing that input. But who we don't really have a really great mapping of every person's brain.

I think the connection between a neural net and actual neurons has been lost a long time ago, right? how does the human brain work and how does it really compare to modern day models? Like it's hard to really make that argument, we're still learning about how we learn. And as we do, and as neuroscience filled advances, like ultimately leads to advances in the AI space and vice versa.

Concluding Thoughts and the Future of AI Research

there's definitely connections there. but yeah, as far as your question goes, I think it's anybody's guess. I think this is a perfect note to end. A little bit of suspense. we're going to have to get you back at some point when you've finished your book and talk a little bit more about the actual technical problems and challenges. We haven't really touched upon any of that yet, but today I certainly learned a lot from you and I hope a lot of our listeners will as well.

It was an absolute pleasure to meet you both. Thank you so much and see you next time.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android