Measuring LLMs with Jodie Burchell

Speaker 1

00:01

How'd you like to listen to dot NetRocks with no ads?

Speaker 2

00:04

Easy?

Speaker 1

00:05

Become a patron for just five dollars a month. You get access to a private RSS feed where all the shows have no ads. Twenty dollars a month, we'll get you that and a special dot NetRocks patron mug. Sign up now at patreon dot dot NetRocks dot com. Hey guess what it's dot NetRocks episode nineteen forty four. I'm Carl Franklin.

Speaker 2

00:39

At amaterid cap nineteen forty four. Richard, I'm looking forward to the end of World War Two. Yeah, it's the beginning of the end. So nineteen forty four, the Allies launched D Day, the largest amphibious invasion in history, landing troops on the beaches of Normandy, France on June sixth, marking a turning point. In August, the Allied forces liberated Paris from Nazi occupation. You're welcome. In December, here's an anecdote to go with D Day if you like. Yeah.

01:10

In concern for the soldiers in D Day, they mass produced penicillin for the very first time. There were two and a half million doses of penicillin made for the D Day invasion. That is so awesome. So post World War two, the reason we have antibiotics was that preparation. Yeah.

Speaker 1

01:27

In December, the Battle of the Bulge, the Germans launched a major counter offensive in the Ardennes region of Belgium.

Speaker 2

01:34

Did I say that right? Arden in the Ardennes? Yeah? Yeap Neardens.

Speaker 1

01:38

But the Allied forces eventually repelled the attack and in Rome, three hundred and thirty five Italians were killed in the Here's another thing I had but pronounce correctly in high school. Are eighteen r D ten ardittin all right?

Speaker 3

01:56

Right?

Speaker 1

01:56

R D ten We're going with that ar d eat I n E massacre including seventy five Jews and over two hundred members of the Italian resistance, various from various groups.

Speaker 2

02:07

So yeah, it's sort.

Speaker 1

02:08

Of the beginning of the end, the unwinding and leading up to the following year, nineteen forty five, which ended it.

Speaker 2

02:17

Right. Yeah. It's also the year that the first plutonium has ever made in the Hanford site in Washington, will eventually lead to the bombit Nagasaki. Yeah. And the Harvard Mark one, the built by IBM based on a design from professor at Harvard thirty five hundred relays and a fifty foot long camshaft because computers were different back then. Yeah, they were, and famously because it's a relays based computer.

02:42

The next version of this they call, cleverly, the Mark two. Yeah, we'll have a moth get trapped in one of the relays, which race Hopper will find and remove and call the bug, and that will be the first bug, first bug in the machine. Yeah. I don't use a lot of relays and computers anymore.

Speaker 1

02:59

Yeah, And before we get started with doctor Rachelle, I wanted to just have you comment on the amazing recovery of the astronauts and the space station that happened this past week.

Speaker 2

03:11

Really not that amazing. It was so perfectly you know, it was an unexpected things. Those Butcher and Sonny both very experienced astronauts. When there was concerns about Starliner, they sent up the next crew with only the next crew on a crew Dragon with only two passengers, so they had the two additional seats for them to come back at any time. Yeah. But since they had two extremely qualified astronauts already up, why pay to send them back down when you can put them to work and in fact,

03:42

they put Sonny in charge of the mission. She took over as mission commander for the station for the duration.

Speaker 1

03:48

And she and Butch were happy to stay there. They were like, no, we don't want to come home.

Speaker 2

03:52

Come on. Totally. They were never going to get to fly again. Those are retired astronauts, right, Yeah, so they got a great gig. Now that's going to take them more than a year to recover, which is also normal for a six months day, and they had a nine months day. Mark Kelly did a year, and you can read his book on this, Like, recovery is not a trivial thing. Yeah, I was watching him being interviewed. You know, you haven't walked on your feet nine months, your vestibulous

04:17

systems messed up, your eyes have been bent out of shape. Like, it's not a small problem, right to recover from this.

Speaker 1

04:23

Yeah, I watched being interviewed on the news when it was having it. It's just still amazing to see that falcon Booster land.

Speaker 2

04:31

Land on his tail perfectly.

Speaker 1

04:33

Always it always is just going to be amazing to me.

Speaker 2

04:36

Yeah, no, it's it's a miracle. The crazier thing is it really is that starship Booster being caught out of the air. It's literally a twenty story, two hundred ton building that flies, yeah, and they catch it out of the air. So yeah, we are in amazing time. So the space industry has been funnelingentally changed by this, right. The cost of flight is so much lower. It's hard to even get her head around what's actually going on up there right now. So it's very cool with the proliferation.

05:04

That's a very good experience for me. This week is very I felt very good about it, all right.

Speaker 1

05:08

So yeah, so that's a cue for me to roll the music for better no framework.

Speaker 2

05:12

So that's awesome. All right, man, what do you got our good buddy, Simon Crop has the genius. Simon Crop the ge. This guy is just he's so brilliant. He's brilliant and he comes up with solutions for things that you didn't even know you need it. Yeah.

Speaker 1

05:33

But this one is called symbol. It's a new GET package and it's an MS build task that enables bundling dot net symbols for references with a deployed app.

Speaker 2

05:44

Nice.

Speaker 1

05:44

The goal being to enable line numbers for exceptions in production.

Speaker 2

05:50

Oh okay, that's interesting.

Speaker 1

05:52

Yeah, because I guess you don't get that. Yeah, yeah, and this is this is what it does. So if you're in production you have an exception and yeah, I guess you log it, you're gonna see line numbers, all right, Yeah.

Speaker 2

06:06

That's cool. You got to know he had that problem, right, like, yeah, this is clearly a guy who built the thing to fix a thing that he had, and now we all get to benefit.

Speaker 1

06:14

Another alternative, I guess is just deploying the debug symbols with it, and now you're slowing things down in productions.

Speaker 2

06:20

So yeah, it's a lot more weight than just yeah, you know, use this library.

Speaker 1

06:25

So thank you Simon and Simon. Crop slash symbol on.

Speaker 2

06:29

GitHub continues to be awesome.

Speaker 1

06:31

See y mba l Yeah, the musical thing, the musical thing, all right?

Speaker 2

06:35

Who's talking to us? Richard grabbed a comment off a show eighteen thirty five of them when we did with our friend mattz Targanson talking about the next C sharp because we've got a great comment LLM related. This is from Murray who said MADD's mentioned making sure language features work with the tooling, such as ordering and link syntax. Increasingly with Copilot and other lms, this is part of the tooling. Yes. True. Obviously this is a year ago

06:56

this comment, so you know so much changes happen. It's challenging. So given a piece of code using a new C Sharp language feature, which is what Mads was talking about, have you tried asking chat, GPT or copilot or so the LM to describe how that code works. If it gets it right, does it mean it's intuitive. He's an LM's intuition and at least you put that in quote,

07:19

because there is no intuition in software. There is a grood approximation for the one that human programmers have, or a bad approximation, and if programmers are using copil, it doesn't matter about the human's intuition or the LMS. Let's complicate this fact with next year's LM that would be now, which will probably be profoundly different. Yes, so, having said all that, it's probably best to just aim for the human and let the LM catch up. Yeah, no intuition

07:45

in software. The reality is, of course you would expect it to not understand a new language feature. There has to be some time for that language feature to be documented properly. The good news being as they keep regenerating these LMS on a regular basis, and Microsoft builds these features in public view on GitHub even before it ships.

08:07

It's likely in the knowledge base that is the al Yeah, curiously, you know, in my last trip to Microsoft talking to folks, so what they're using, they've been using Claude Sonnet three seven. That's their favorite for working in dot net, which isn't that funny? Fascinating, But you know that's where it's at.

08:27

So Mary, you're right, let's focus on the human understanding the language the most, because the software is only going to generate what it's got in its model, and it's up to you to assess it, although admittedly the compiler has to say also yes, and a copy of music Cobey is on its way to unit. If you'd like a copy of music code by, I write a comment on the website at dot netroocks dot comment on the facebooks.

08:46

We publish every show there, and if you comment there and everything in the show, we'll send you copy of music code By.

Speaker 1

08:50

And if you don't want to wait for that, or you have other ideas and you just want to buy music to code buy, you can go to music tocode buy, dot net and track twenty two is new ish and you can get the entire collection an MP three flacre wave for a very good deal. It's a very good price, So happy coding, all right, Well, let's bring on doctor Birchell. Doctor Jody Birchell is the developer advocate in data science at jet Brains and was previously a lead data scientist

09:22

at Verve Group Europe. She completed a PhD in clinical psychology and a postdoc in biostatistics before leaving academia for a data science career. She has worked for seven years as a data scientist in both Australia and Germany, developing a range of products including recommendation systems, analysis platforms, search engine improvements and audience profiling. She's held a broad range of responsibilities in her career, doing everything from data analytics

09:51

to maintaining machine learning solutions and production. She's a longtime content creator in data science across conference and user group presentations, books, webinars, and posts on both her own and jet Brains blogs. In other words, a slacker.

Speaker 2

10:09

It occurs to me, Jody, that you and I hang out several times a year of various conferences, But I don't know that Carl's had time with you since we did that show at Tekarama. Takarama was the last time I saw you, No a couple of years ago.

Speaker 3

10:20

Yeah, yeah, exactly, So it's been a long time actually, Yeah.

Speaker 2

10:25

Things have changed your jet brains now.

Speaker 3

10:27

I have, certainly I think changed a lot. Yeah, yes, yeah, yeah, I was a jet brains when we first met as well, but I think I had only been there just over a year and so I was still like, I don't know, a little bit more shy, I think, a little bit less opinionated.

Speaker 2

10:45

You've been hanging around with the troublemakers for a while.

Speaker 3

10:47

Now, yeah, you talking about you?

Speaker 2

10:49

Yeah?

Speaker 3

10:49

Actually, well, and we're going to be hanging out in my hometown of Melbourne next month.

Speaker 2

10:59

Yeah, we're excited about that, yeah, NDC. Yes, so, And of course I've got family in New Zealand, so I've got to do a little time in Sydney to see some folks there, and then I'll be in Melbourne for the show with you, and then a week on the farm hanging with the cows and the cousins and the sheep and the sheep, No sheep, the sheep, what sheeps? The South Island thing? No sheep on the farm. No, no, it's it's it's a dairy farm. Dairy farm. Yeah. And

11:26

by the way, cows are awesome. Sheep are dumb, dumb, dumb dumb, holy cow dumb. But they're tasty. Like how Jody says they're cute. I say they're tasty tasty. Where my mind is at, that's in the cow. The cows are smart enough that if they're actually having distress, you know, in birthing or anything, they will come for help. Wow. Right, Like they're bright and they and they follow the they follow the gates of the packs where you want them to go. But it doesn't mean they don't know how

11:53

to open them themselves if they really wanted to. I've seen them do it. Yeah, damn, they're just playing along. Cows are great, they really are. In lls are great.

Speaker 3

12:01

Right in the right settings. Yeah, they are great.

Speaker 2

12:06

Yes, But even that that show we did in twenty three, you know you were the grown up in the room there, it's just tired, Like listen, there were limits like that. We're so hype ish in twenty three, not that it's all common rational in twenty five, but it's so.

Speaker 3

12:21

Funny actually, because I remember I was this was the first talk I did on LMS, so that one at Techorama actually was the first one I ever did.

Speaker 2

12:29

No free lunch.

Speaker 3

12:30

Yeah yeah, yeah yeah, And I was I was actually really scared of getting up and giving my opinion, like being a contrarian. Obviously, I'm feeling so vindicated right now.

Speaker 2

12:39

But it's right, isn't it.

Speaker 3

12:41

It's great being right, but it's I will say, like the hype has died slower than I thought it would. So I think Deep Seek finally has spelled the beginning of the.

Speaker 2

12:51

End, but not the end of the business, but the end of the hype cycle.

Speaker 3

12:57

The end of the hype cycle.

Speaker 2

12:58

Okay, I appreciate that the approach.

Speaker 3

13:00

To how we're going to be I guess, manufacturing these models, deploying these models, and thinking about these models fundamentally changed with Deep Seeks. So m it sort of showed that this hyperinvestment in data centers, which was kicking off with the Stargate project in the US. To explain context to anyone in the audience who doesn't know.

Speaker 2

13:21

It, five hundred billion dollars.

Speaker 3

13:23

And intended five hundred billion dollar investment between Open AI, the US government, and I think Microsoft was involved so I.

Speaker 2

13:31

Think Microsoft pulled out of it.

Speaker 3

13:33

It was gorecle very okak got you Yeah, yeah, that just got announced.

Speaker 2

13:39

Yeah, there was a little political game here is that was also run around the town. They sort of announced this, Hey, you know, I know we had this deal with open Ai wherever there's going to run an azure, but we're ready to let that go. I think it was because of Stargate that. Yeah, you know, there was sort of this pressure on Microsoft. You have to keep growing, growing, growing, and they're like, this is getting irrational. So if you want to go play with someone else, you knock yourself out.

13:59

So yeahing it back to deepseek for a minute. From what I understand, you know, the open Ai and all these other models are looking at that and learning from it and figuring out how to make their own models more efficient. And at one point I heard that the Chinese model is, you know, hey, let's spend a lot less money on these things so that they're less expensive. We don't have to use as many processors and all

14:26

that stuff. And I think I heard that, you know, the response from the American companies was, oh no, we're just going to make it ten times more one hundred times more powerful, you know, so a different kind of mindset whereas but that was originally Now I think that there's more of a desire to make to get smaller lllms, right, yeah, that are more specialized.

Speaker 3

14:54

The new ones of the story is that basically we've known that there are ways to make neural nets more efficient, right like, there are ways of making the models smaller, or after you've trained them, actually trimming them down and getting the same performance or almost the same performance for

15:13

much smaller number of parameters. We've also known for quite a long time, and this is true with any machine learning model, that the higher the quality of data the you know, the better the model can perform for much smaller number of parameters. So this was proven last year with the Falcon last year or the year before with the Falcon models, they were sort of the first big open source ones that were trained on higher quality data sets and got a lot more performance for less parameters.

15:38

But the most reliable way to get better performance was to scale, and I think what happened. The story I've heard in China is that they just couldn't get access to the same size of GPUs because of sanctions. Not sanctions Basically they weren't being sold in China, and so they had to make do with older and much less efficient processes, and they had to do all these tricks to basically share the training across a bunch of smaller machines.

16:11

So this meant that they just couldn't create absolutely massive models. And essentially this meant that, yeah, they were forced to create a smaller model. But you know, the thing is is the quality of AI researchers and AI engineers that are being employed at companies like Open Ai and Anthropic and companies like this. I'm sure that they knew it was possible. It was just as I understood it, a

16:38

less reliable path to performance. And you know, the American companies had they had the money and they had the servers to train it, so why not go big?

Speaker 2

16:49

And they understand that race right like they understand build bigger, keep going like it's a very American approach to things. Yes, you can always tune later, right, do your land grab now, but.

Speaker 1

16:59

Also that there's a difference between having one huge model like you know, chat ept that knows everything as bazillions of nodes or whatever it is, and then can you know, can cross reference things right and put connect the dots very much in ways that humans do, but in even more broadly, Whereas if you have smaller, less expensive models that are just our lllms that are trained on specific data, right, you'll get probably get more accurate things out of them

17:36

for that particular set you know, that particular context maybe and then be able to have many of those with that have different expertise, but you won't necessarily be able to it won't necessarily be able to connect the dots like a large, huge model can.

Speaker 3

17:53

Right. This can actually lead into a further discussion about measurement if we want. But basically, looking at the current benchmarks that they're using to assess performance of llms, Deepseeky and smaller models coming out of China are actually rivaling the performance of larger models. So basically the understanding seems to be is that a lot of the parameters that these big models have are not actually being used every single time you try to do like inference for a

18:23

particular task. It's only a subset of the parameters. So the way to think about parameters is think about neural nets as like you have inputs and then you have a bunch of neurons that are connected by what are called weights. They're basically multipliers, and you can kind of think about inference as a path that you take through the neural net, where like, you know, the whole thing's going to be used, but only certain weights will actually

18:49

have an impact for particular types of tasks. And it sort of seems that what's happened with scaling down these models is that because they learned on so much data, and so much of the data seems to have not been high quality, that they really, like a lot of the parameters were not really being used in the majority of cases, they were just I see dead weight.

Speaker 1

19:14

And so so if you wanted to translate parameters and neurons to language, we're talking about the probability of the next word exactly right that it spits out. Yeah, and what you're saying is that they're only choosing from parameters with higher weights.

Speaker 3

19:31

Yeah, it's it's like or words.

Speaker 2

19:34

With higher weights.

Speaker 3

19:35

Yeah. So basically the way it works is, like you think about the last layer of the neural net is basically like all the words in the vocabulary. So it's obviously really really huge, and so the whole neural net is trying to predict to the probability of which of these words is the most likely to come next. So it's basically saying that for a particular import only a subset of that, you know, the paths that go through the neural net are actually going to give good information

20:04

about what the next word is. And so yeah, it's it's also like it's kind of fascinating because the models are such black boxes. No one fully understands how the decisions are being made. I'm putting decisions in air quotes. I want to make this clear because interpretability is hot, but this is actually interpretability is becoming a really hot area in twenty twenty five. So actually understanding how llms come to the conclusions they come to, or sorry, how

20:34

the predictions being made. Let's put it in more clinical terms, and that's going to help firstly make the models more efficient, but also demystify a lot of the assumptions we make about the predictions they make. Like we look at the prediction, we're like, oh, it's solving problems because if a person did that, it would be showing problem solving. Or the model's more intelligent because if a person did that, it would be showing more intelligence, but that's just us projecting.

Speaker 2

20:58

Sure, yeah, anther of morphisation. Now you know, I'm maybe I'm thinking about this the wrong way, but you know, as soon as you say that, I'm like, hey, there's like, what six hundred thousand words in the Oxford Dictionary that's just English and most people use fifteen hundred of them. So oh yeah, yeah, yeah. You know here you've built this model that has this huge potential range of comprehension and you're using a tiny subsect of it depending on

21:21

what you're doing. Especially when we're coming at this from the copilot part of you was like, I'm working on code.

Speaker 1

21:27

Yeah, every symbol in the language is a is a word essentially right.

Speaker 2

21:33

So, but you also talked about performance. In My immediate reaction was, so, what do we mean when we say performance?

Speaker 3

21:42

Yes?

Speaker 2

21:42

Is that speed? Is that a speed measurement or is that an accuracy measurement?

Speaker 3

21:47

Yeah? So to kind of put this in context, I gave a keynote to NBC Porto about all the hairy things that go along with assessing LLM. So I didn't get into speed. We can come back to that if we get time. But it's more about like how do

22:05

people judge if these models are good? And last time we talked and you gave the episode this name, we talked about the concept of there's no free lunch in machine learning, and what this means is there is no there's no one model that will be best for every possible task you can do.

Speaker 2

22:24

Right.

Speaker 3

22:25

But what we've seen with the way people talk about llms is there advertised exactly like this. Like it's like, oh, open Ai just came out with the one model, and it is the best model on the market, right, right, And even if we're not, let's put like engineering considerations aside, let's talk about like, let's put cost aside, let's put speed aside. That's still not going to be true.

Speaker 1

22:47

It's like who's the best guitar player in the world?

Speaker 3

22:51

Yes, how do you measure this?

Speaker 2

22:53

That's an impossible question? Answered well, I think when they were saying best that time, we were talking the largest number of parameters, weren't they.

Speaker 3

23:00

Well, what they're talking about is there's this suite of benchmarks that are designed to assess LLM performance. And we talked about this last time. But llms were originally designed to be natural language processing task generalists. So they're good at doing a range of natural language tasks, often without further training out of the box, so they can do things like classification, summarization, they can do translation, things like this.

23:30

So generally, when these models were first designed, they were benchmarked against how well they could do these natural language tasks, like specific things like question and answering, translation, blah blah blah. But as as the capabilities of the models have grown, or maybe they seem to have grown, we don't know. What we started doing is getting them to do things like grade school math problems, or we've gotten them to do suites of questions that are designed to assess problem

24:03

solving or blah blah blah. And then what we do is we collate a bunch of these gold standard measures together and we combine them in such a way, and we create leader boards and we rank these models and we say, oh, Okay, this model is the best because it did the best at the MMLU, which is like a reasoning benchmark, or this one's the best because it did the best at like a collated collection of all

24:28

of these benchmarks. So it's doing well on reasoning, and it's doing well on problem solving, and it's doing well on math, and it's doing well on coding. But this is the thing, like, firstly, a lot of these measures have been found to have serious problems. Then they've been found to really not measure what they said they claim to measure in a variety of ways. And the second is, Okay, I am an application developer. I want to design an application that uses an LM. Say I want to make

25:01

a chatbot that can help people plan their holidays. What does it matter to me that an l ELM is really good at solving science problems, grade school math problems, Like is that going to be good for my application?

Speaker 2

25:18

So, got a calculator do you have, like, okay, gotta coverage and.

Speaker 3

25:24

It's probably going to do the math wrong anywhere because they're not symbolically simulating exactly they do that, But.

Speaker 2

25:33

Then you also have it return the response in the form of a limerick.

Speaker 3

25:39

That's fantastic. It's what our customers needed.

Speaker 2

25:41

That's it.

Speaker 3

25:44

So yeah, this is part of the problem. The way we talk about the way we talk about l elms is we talk about them like they are a thing

25:53

independent of machine learning, but they are absolutely not. And part of the problem with that is it means that the way that we use them is we tend to trust their outputs too much, and we also tend to you know, not have scrutiny about like whether a model is the best fit for our use KSE, so we don't design assessments to see like is this actually doing what it's supposed to do, which we would absolutely do with traditional machine learning.

Speaker 1

26:23

I have had the experience of using lllms in you know, both chat, GPT and Copilot to help with coding things, and I found a situation where I asked it to do something, you know, to write something, and instead of pointing to something in the framework that already did that and say why don't you just use this, it just went ahead creating the thing, you know, reinventing the wheel. And then you know, an hour later, I've got something

26:55

that works. But I'm like, hey, there's something in the framework that works just like this.

Speaker 2

27:00

Yeah.

Speaker 1

27:02

So it's that's why you need a human in the in the equation.

Speaker 2

27:06

Although although where's there a prompt there to say is there a class that does X?

Speaker 1

27:11

Well, that would have been yeah, that's the human error that because that should be the first question. It's like when somebody says, you know, I have an idea for an app, and my first question is, well, first of all, I don't I'm not going to write it for you unless you pay.

Speaker 2

27:25

Me in second of all, does it already exist? And the answer is usually yeah, if it's really that good of an idea, somebody else's somebody else has done it. Okay, well, why don't we take a break down. I want to dig into some of these evaluation strategies.

Speaker 1

27:39

All right, we'll be right back after these very important messages. Stay tuned. Do you have a complex dot net monolith you'd like to refactor to a microservices architecture? The micro Service Extractor for dot Net tool visualizes your app and helps progressively extract code into micro services. Learn more at aws dot Amazon dot com, slash modernize.

Speaker 2

28:05

Am We're bag. It's done at Rocks Amateur canvill that's called Franklin talking to doctor Jody Burchell. Hi. And if you don't enjoy those those ads and you'd like an alternative, we do have a Patreon that provides an ad free feed. Let's go to patreon dot com. Check it out Patreon dot dot NetRocks dot com. Yeah, so I found the deep avow site that talks about MMLU. But nearest I can tell this is just a set of questions in different topic areas.

Speaker 3

28:32

Yes, yes, so so let's talk about benchmarks. So there is a very famous leader board called the Hugging Face open Fellolane leader board something like that. Okay, So hugging Face is a company. They're based in France and basically what they do in their open source branch is provide access to all of the major open source what are called foundational models, so big l lams that are open source computer vision models, those that can generate audio, you know,

29:15

do transcription, all these sort of things. And so Hugging Face take the open source models. They run these models against a suiteter benchmarks and then they call aid them. And they used to have a used to have a leaderboard up until June last year. This was the first version and it included scales like Hella, Swag and the MMLU. So it got retired for a couple of reasons. But one of the reasons that got retired is people started

29:46

going through the questions and MMLU was bad. It had a few questions that literally were like I think one of them was something like the continuity of the theory. That's that's the full question. And then it was a bunch of multiple choice answers that were just lists of numbers like that was the question, and think about, Wow, the gold standard is a human, so humans meant to be able to answer this, And then you rank how well the LM goes, and I'm.

Speaker 2

30:15

Like, nobody can answer that.

Speaker 3

30:17

What does this mean?

Speaker 2

30:18

What does it even mean?

Speaker 3

30:20

What does this mean? But my favorite, my favorite, my favorite was Hella Swag. So apparently Hella Swag I think was made using mechanical turk, so they got people to generate the questions and then validate them. But clearly whoever picked up this task was like not particularly invested, Like, you know, they're not getting paid a lot of money, they probably didn't care, right, And I have actually an

30:42

article with some of my favorite absolutely bizarre Hella Slag questions. Okay, now keep in mind I am reading this out as it's written. Okay, So we have a question, and we have a bunch of multiple choice answers, and what the LM is supposed to do is complete the scenario. So it's meant to pick the option that has the most you know, fitting scenario end. Okay, so I've got one for you. Man is in roofed gym weightlifting. Woman is

31:15

walking behind the man watching the man. Woman is a tightening balls on stand on front of weight bar b lifting eights while he man sits to watch her cee doing mediocrity spinning on the floor, D lift the weight lift man.

Speaker 2

31:37

That doesn't make any sense.

Speaker 3

31:39

It doesn't and probably around a third of the questions in hellaswag with this garbage.

Speaker 1

31:44

I just want to know what mediocrity spins are. I want to do that, and I just don't know because I don't.

Speaker 2

31:51

Know the definition.

Speaker 3

31:52

That's every time I turn around and knock something off a shelf with my clumsy hair. That was twenty twenty. That was mediocol for the child.

Speaker 2

32:04

Yeah, there's been some times. So I mean this just seems lazy then, like the well, let's back up. Is asking questions of an LM actually a good way to measure its effectiveness?

Speaker 3

32:16

Now? Yes and no. So you can create well defined problem suites if you have a good idea of what you're assessing. So this is this is basic measurement theory, right, It's like we learned this in psychology. It's tricky with llms because we have a tendency to extrapolate too much. We try to project what their performance would mean if a human did that, and we can't do it because llms do not have what's called fluid intelligence or general intelligence, right.

32:49

They have what you could essentially call crystallized intelligence, which is that they have a bunch of little templates of how things work based on scenarios they've seen before. They can patent match questions they see against this, So you've got to be really careful about how far you deviate from the doing patent matching to their showing intelligence, right, But it is possible. Let's say you want to assess how well they do specific tasks, like they can answer

33:18

questions about history or whatever. That's fine. I think that's fine to assess. It gets tricky because there are two main problems with using questions other than the one I've just said. The first is is that the answer type that an LM is presented with actually impacts their performance. So most of these measurements use multiple choice questions, and the reason that they do that is because it's much easier to score because they're essentially ways of seeing, you know,

33:50

the probabilities of words I was talking about. You can quite accurately tell what's the highest probability sequence that it would have ended up predicting based on, you know, the ones that is present with, So you know, it's much much easier to work this out. But you can also get them to generate answers, and generating free form answers is really hard to assess unless you're gtting humans to actually compare it to a gold standard because the statistical

34:15

ways we have of comparing two sequences are imperfect. So most of the time people will use these multiple choice answer keys. But the problem is is that elms seem to do a lot better when they're presented with multiple choice answers compared to free form answers. Sure, and the reason it seems to be is because it's a lot easier to just pattern match to something they've already memorized

34:40

as opposed to having to generalize a bit more. And then the second big problem is hell, elms are ridiculously sensitive to the format of the prompt template you use. We've already talked about this, like did you tell them to use a framework that already exists? But it's so much more subtle than that. So using a different placement of punctuation, using different spacing, this can impact the performance

35:11

of LMS on task by like thirty fifty seventy percent. Wow, yes, yeah, and like why it seems to be again pattern matching. So if that like particular formatting is closer to something that it's seen already in training, it's more likely going to be able to get it right.

Speaker 2

35:32

So all I got to do is ee cummings my prompt and it just.

Speaker 1

35:40

Exactly it's just Richard invents a new ferbs.

Speaker 2

35:45

Yeah. But you know an interesting point, like anytime you want to remind a person that this software is not intelligent, it's that that recognize that this is pattern matching. The fact that that as a human, I can hand you only lower case there's no punctuation statement or a perfectly punctuated statement, and you'll see it as exactly the same, just one lazier than the other. But the software treats it completely differently.

Speaker 1

36:12

Exactly do you guys know the story of what the moment that Bill Gates went nuts over chat GPT and began to trust it and his mind was blown over it. So the Richard I sent you a link in the chat you can post it there. This is the story and I heard about this story on the on the radio. So the story from this CNBC dot com things. Bill Gates watched chat GPT asen ap bio exam and went into quote a state of shock. And this was August eleventh,

36:48

twenty twenty three. So but what you don't know, and I don't even know if they say it in this article, I don't think they do. But a couple of months before is when Sam All actually showed Bill chat GPT and he added a couple of things and he said, you know what, you know, it would be a great test. Sam is if we could give it the ap bio test. And it aced it and then Sam goes home and two months later brings it back and it ass the exam. So what do you think happened in those two months?

Speaker 3

37:27

What a mystery.

Speaker 2

37:29

It's so strange, I can't imagine. I'd also point out that a GAP exam is largely multiple choice.

Speaker 1

37:34

There you go, Well, I mean I thought it was I thought they were essay questions. There were five questions, but I don't I didn't read that part, but I heard that they were five five essay questions. Anyway, they did not say that in the article, but that that apparently happened.

Speaker 2

37:53

It all depends on what you.

Speaker 3

37:54

Train it on, right, Yes, and this is actually a bigger issue. So this is an issue that's called data leakage, again, well known problem in machine learning. It's when your model gets access to the test set during training that it can basically learn the answers and well, the implication from Carl is that this may not have been an accident this time. But you know, we don't have a clear idea of what's in the training data for a lot of these models. Even open source models now are being

38:25

super cagey about what's in their training data. So they say it's a competitive advantage. But we know from experiments people have done that even benchmarks have ended up at least partially leaking into the data. So we know that a lot of these companies will optimize for benchmarks. They'll keep training the models until or not keep training them, but they'll keep tuning them until they do well on benchmarks.

38:49

But even accidentally, because they're just scraping the open internet, sure they've accidentally shoved a bunch of these questions.

Speaker 2

38:57

Which is probably where the benchmarks came from in the first place. Any so exactly eventually you're going to meet up with the data. It doesn't seem surprising at all.

Speaker 3

39:06

So yeah, the modern suite of like benchmarks that started being created last year, they started making them private reasons to mitigate this. But it doesn't mean that you know, you as a consumer, you're a lay user of an l l M. Maybe not a lay user. You might be a bit more technically advanced, but none of us here are. AI research is right, right, and so we might be not far from that to inform consumer. Let's put me that way.

Speaker 1

39:40

But you know, I just found it. I'm sorry to interrupt, I just found it. A story is that, you know, Gates issued what he believed to be a rather difficult challenge to Sam Oltman, bring chat GPT back to me once it could exhibit advanced human level competency by achieving the highest possible score on the ap by l exam. And so two months later, oh Magic, Open a Eyes developers came back and Gates watched the top score of five on the test. So so yeah, so there it is,

40:13

right in black and white that actually happened. And as I was hearing this, I was like, you idiot, why didn't you just say give it to me when it can answer a test question and don't tell them what test? Yeah, you know a test question and then just do it.

Speaker 2

40:32

Try it.

Speaker 3

40:33

Yeah, I don't know.

Speaker 2

40:34

Far be it from me to call Bill Gates an idiots? Did I actually do that? But you know, there's this confirmation bias situation you can put yourself into. Yeah.

Speaker 3

40:43

And this is the thing too, Like I don't blame people for feeling enchanted by the models, like there is something so human feeling about them because they're echoing back our humanity. Yeah, but you need you always need to be CAUs and like we were trying to do at the beginning, like you see, we slip into anthromorphizing the models even though we know better.

Speaker 2

41:06

For sure, because it's easy.

Speaker 3

41:07

It is easy. But really, like even with the latest benchmarks trying to assess AGI, this one called the ARCAGI that Open Eyes three actually did very well on got seventy percent late last year. This is still just pattern matching, but pattern matching in a more organized way. It's basically the model has more of an ability to sort of sort through which patterns might be the best to apply. But again, we're just talking about a more systematic application

41:46

of crystallized intelligence. We're not talking about generalizability yet.

Speaker 2

41:50

Yeah, And I mean the more I read, the less I'm concerned about the AGI side of the equation. It seems more and more like a marketing term to hire more people to work at open AI.

Speaker 1

41:59

Yeah, it's only a fluid term that keeps changing. The definition keeps changing.

Speaker 3

42:05

But how do you assess AGI? Like I don't know if we talked about this last time, because I had that in my first talk, But you know, how can you even assess the gap between what a model knows and you know, a task, so like the difficulty that a model would have doing that task based on what it already knows, and then standardized that across a bunch of different models that have potentially been exposed to very different tasks and knowledge like it. Sure, it feels it feels like such a difficult.

Speaker 2

42:35

Challenge and it's way too broad, and ultimately I feel like it's a distraction from the fact that we're just trying to be engineers making using a useful tool. And I mean I let off this conversation talking about the fact that I always ask folks like what one are

42:52

you using right now? What are you enamored of? And the fact that you know, I had sort of a universal everybody likes cloud right now, it's like, why what do you What is your innate benchmark that made you switch to this or is it just a social pressure thing vibesmen because that smart person was using Cloud. Now I'll use claud and then there'll be some nice confirmation bias there. Well, yeah, no, it seems to be doing the thing. Is it actually better than the chat GPT information?

43:15

I don't know how would you measure that? So we're in this loop and I don't feel like I can get a version, a new version of anything from any of these folks come out and you open AI a new cloud or any of these and say, okay, is it worth switching? Yeah? I mean I know they want me to. I know it's invariably more expensive, but is it better?

Speaker 3

43:37

Yeah? Look, I have a prediction that here we go. I'm going to do a prediction why not?

Speaker 2

43:43

Why not? Why not? Why not?

Speaker 3

43:46

I'm going to say probably in a year's time, the landscape of providers is going to look quite different. Oh, definitely, And it's because the advantages of using smaller models is just drastically outweighs using bigger one. They're cheaper, they're more momentally friendly, they're more more specialized, they can be more specialized, like it's easier to tune them so that you can

44:08

focus them on specific tasks. And yeah, ultimately they're just you know, it's easy to control what happens to your data.

Speaker 2

44:15

So right, that's a big one.

Speaker 3

44:18

That is a big one.

Speaker 1

44:19

It's a big one, especially with something like deep seek. You know, the only way I'm going to run that is on my own network not connected to the internet.

Speaker 2

44:27

Well, they do often a local offer a local version, Yeah, they do, right, which n video has been benchmarking with a fair bit I noticed, like, which I thought was cool, like smart thing to do, not just to because there's lots of folks saying, no, don't use the Chinese LLM. But yeah, the fact that video just said.

Speaker 1

44:45

That's the same reason they're saying don't use TikTok, right, so they don't trust it more or less?

Speaker 2

44:49

What could happen I don't.

Speaker 3

44:51

Use Yeah, I don't use TikTok just because I'm deeply uncool, Like it makes with you.

Speaker 2

44:58

I'm so with you.

Speaker 3

44:59

I actually had to make a TikTok. I was at a workshop for my job like two months ago, so you know, I'm an advocate. So they're like, hey, let's teach these old people to make tiktoks nice. And I made this TikTok with Michelle you know, Michelle Richard, Michelle Frost, so yeah, yeah, yeah, she just started with jet brains. And so we made this TikTok with another of our colleagues and an involved Wilbur, her dog, and it was

45:23

just like it was so bad. And then they're like it's awesome, like you should go on TikTok right now, and I'm like no, I'm deeply ashamed, like.

Speaker 2

45:34

I should see.

Speaker 3

45:35

This was bad.

Speaker 1

45:39

I do have to confess that I have a TikTok account Carlotphoenix dot com. I have not used it yet for anything more than hey, I'm here, and I certainly don't scroll TikTok. I have so many better things to do than to scroll inane, insane, crazy music videos of people doing stupid things, and dogs and.

Speaker 2

45:58

Cats are also. You know what it's interesting about TikTok is you're not really picking the content they're picking the content. Yes, yeah, they are watching your loiter time, so it's your behavior that's only selecting the content. But you know, it is a different mechanism there where you can't really curate a list or build a social graph. That's not up to you. And and I find that interesting, right, like that, were literally are handing over our attention to something else that's driving it.

Speaker 3

46:28

Yeah, I do have to admit I'm so curious about their recommended the.

Speaker 2

46:32

Yeah, well, as a technologist, right anything, like well, because that's the thing that they're all upset about. This is our secret sauce, and we'll be keeping it to ourselves, thanks very much. Oh, it's well, here's the thing. Is there a way when you see something that you don't like to say, give her to death. I don't want to see this again? No, because it's too late to scroll past it. You've already scrolled. The problem is that before you found out you didn't like it, you watched it.

46:57

You know, it's the old old I'm trying to improve the quality of my diet by eating everything and deciding what I like. Yeah, there is no nutritional label on any of these things.

Speaker 3

47:13

Sorry, it's called democracy.

Speaker 2

47:15

Okay, so that's what you want to call it. The only person who doesn't get a vote is the viewer. I'm not cynical at all. I don't know what you're talking about.

Speaker 1

47:28

What you remind me of David Mitchell, the British comedian who just every once in a while will just go off on a rant.

Speaker 2

47:36

Just start and he'll keep me going. I'm fine, Everything's fine, Okay. Yeah, we're still at this core issue of how do I select an LM from my app? I mean, part of it is the running contact or I can I can go down the cost side and can go down the does it, you know, integrate well with my Do I any cloud access? Or can I run local?

Speaker 3

48:02

Like?

Speaker 1

48:02

There's all those decisions we need an LM to answer this question.

Speaker 2

48:06

I don't think.

Speaker 3

48:09

I have something even better. I've got a blog post, hey, are right? He so I will share this with you so you can share it with the audience. But I came across when I was writing my talk, I came across this absolutely phenomenal blog post. So guy's an AI engineer called Hassan Hussein. So this guy works in an AI consultancy. So exciting job these days goes out and he needs to basically build applications for companies that use AI.

48:39

And one of the jobs that he talks about is he and his company were hired to build a chatbot for real estate agents. So basically, they wanted the real estate agents to be able to type in natural language, like give me the contact details for everyone in this area, whatever you know, and then the LM would generate a quer to a CRM something like that and return the information.

49:03

So when they first started building the app, they like, they picked a good LM based on the leaderboard, good one, and then they wrote the initial prompt templates and then you know, everything looked good, and then things started not working on the edge cases, so they made the prompt a bit more elaborate, and then the prompt started getting really unwieldy, and then they realized the only evaluation metric they had was vibes and they were like, really, this

49:28

is a mess. So he set out in this really interesting way how they actually went back to ground zero and they started again, and he said, like, basically, we realized we needed a tiered assessment. So he said, like, the first tier of assessment is unit tests, Like it seems really obvious, right, But he's like, the thing is is like, because it's nondeterministic, you're not going to have

49:54

one hundred percent pass rate on your unit tests. So you need to determine what error rate you're happy with, and that's going to require a bit of experimentation.

Speaker 2

50:03

But you also have to accept the level of ur rate, like you're not getting all agree exactly.

Speaker 3

50:07

Exactly, so it might be like you just need ninety five or ninety nine percent or whatever to pass whatever looks realistic. But you know, an example of unit tests is let's say the query from the user was return me the phone number of you know, Jane Smith or you know someone like that, and then basically what you're going to expect from the CRM is a phone number, So you can write a unit test for that. You know,

50:31

it's basic engineering. And then he said, you know, you can create a suite of manual evaluations, so you basically look at the traces how the LM is interacting with the users and the rest of the system, and you manually evaluate that. And you don't have to keep doing that forever because then you can use a new method called LM as a judge just where you get another LM to also do the same assessments and trying to

50:59

get them to converge. And once you have a relatively strong sense that the LM is giving similar assessments to your human you can you know, you need to check in on it from nine to time to see if it's okay. But that you know, takes over that part of the assessment, and then you know, you can go up to your normal kind of higher level assessments like

51:19

a B testing. You know, it's really it's just a normal engineering system, and you can create a feedback loop where you can you know, refine your prompts or fine tune models, or use different models that maybe are smaller or cheaper and see whether you can get the same sort of performance. So you know, obviously you're going to need to just pick a model to start with. You might be able to get a sense of whether it's good for chatbot applications in this language, you know, do

51:46

your research on that. But this really shows me, like it's just it's so obvious, right, like this is how we do monitoring.

Speaker 2

51:56

Yeah, and it's I'm sorry, this looks too adult me.

Speaker 3

52:00

Well, I know, it looks like a lot of hard work.

Speaker 2

52:03

Literally, as you actually have to work at building a day decent testing framework specific to your your case. I know I wanted I wanted a happy button. Jody can be sad I want to button. Yeah.

Speaker 1

52:18

We recorded a show with Spencer Schneidenbach, which is actually next week's show.

Speaker 2

52:23

We recorded it a couple.

Speaker 1

52:24

Of days ago, so we have the benefit of future looking here, and we talked about some of these things with him, and uh, you know that just the comment came up and I think it was even mere Richard. I can't remember who, but you know, we we used to have we used to be programmers that you know, we have a bug, we fix it. Now the program is one hundred percent accurate. And now I mean it's even it's even more like we're a psychologist now instead

52:53

of scientists. You know, we make some suggestions, we examine the output, you know, we think about it a little bit and it doesn't seem quite right. We ask some more questions, examine the behavior.

Speaker 2

53:03

You know.

Speaker 1

53:04

It's like, if these things are going into our software, I have a little trepidation about that just because of the inaccuracies. Even if it's even if something is ninety nine percent accurate. That's that's a bug. That's a one percent bug.

Speaker 2

53:21

Yeah, and when you probably can't pin down.

Speaker 1

53:23

And one you probably can't fix.

Speaker 2

53:25

Use probabilistic tools, get probabilistic results. Yeah, exactly.

Speaker 3

53:32

Look, it's funny because I'm probably so much more comfortable with this than any of you, because I'm like, hey, this is just how stuff works.

Speaker 2

53:39

That was how machine learning always work. When we talk to you in twenty three. You've been doing this for years, and it's like you do the testing and there is no one hundred percent exactly. Yeah, you get in the mid nineties. You should feel good.

Speaker 3

53:50

Yeah, yeah, well sometimes suspicious, it depends sometimes too quickly. But yeah, I think it's an uncomfortable new reality. And you know, it's something I've observed for years when you know, you bring engineers into the world of machine learning, and it is a deeply uncomfortable thing not knowing that something is one hundred percent deterministic. I think the main problem is is it's one thing to say have a system

54:21

that otherwise works totally in a deterministic faction. So let's say you've got some sort of system that say, takes in queries or takes in numbers from a user, let's say, like nutrition numbers for a piece of food or something, and then you have a machine learning model that generates a prediction that maybe within a certain band of correct.

54:44

It's more difficult when you're talking about an LM being an actor as part of that system and generating pieces of code that will then run that system and then generating error in that way is actually quite consequential.

Speaker 2

54:57

I just like that phrase certain band of correct.

Speaker 3

55:03

We call it, we call it a confidence interval. Actually, how confident am I that this is correct?

Speaker 2

55:13

But you know, I think as a developer, when you're talking to leadership that want you to use these tools, I think they're going to provide as advantage. Just like one of their educations is these are nondeterministic models and there will always be a certain level of uncertainty, and if you're not good with that, we don't get to use these tools.

Speaker 1

55:28

Yeah.

Speaker 2

55:29

Yeah, yeah, that's right.

Speaker 1

55:31

So you know, the I guess first analysis you should do in your business is what level of uncertainty are we comfortable with?

Speaker 2

55:39

Can we tolerate? You know, what is the benchmark that we're shooting for? Well? And then the other side of this is the consequences of the uncertainty of the mistake, like what has happen? Right, people gonna die? You know, I know, I see this over and over again, where like the first case of an LLEM in an organization is with an HR system. So it's totally internal. And part of that is because as the consequence omitting correct

56:01

is minor. You know, yes, you're going to make somebody angry if you tell them they have more vacation days than they do, but you probably haven't cost a company a lot of money.

Speaker 1

56:09

Well, and also there's a human there to make sure that the you know, accurate information gets given to the person.

Speaker 2

56:17

I like your optimism, but yes, I would hope you would hope. I would hope. But you know, get the point, like, there is a bunch of ways to manage this uncertainty. So there's a going to be a new corporate title job and it's going to be a nondeterministic compensator. Oh, I think that's from I think I think that's from Back to the Future. You're thinking, CuO chief Certainty officer.

Speaker 1

56:44

Chief Uncertainly that's good. I think you're thinking of uncoupling the Heisenberg compensators.

Speaker 2

56:49

There you go, that's star Trek, Star Trek.

Speaker 3

56:51

Yeah, bouncing all over the place, such geeks.

Speaker 2

56:58

I found the blog post from Hassan and I'll include it in the show, or from Hammel, who's sing man who sings your AI product needs evals And it's exactly the way you describe it. Building unit tests, doing model evaluation, doing a b testing. This seems like a real concrete approach to just how do we at least be able to look people in the eye and say we've done our best to test this and have some certainty around it.

Speaker 3

57:25

Yeap, And well, I think what I like about it is it's not unfamiliar territory for engineers. This is exactly what you've all been doing for decades now, Like this is just monitoring well, or at.

Speaker 2

57:40

Least should have been doing. This is looking like the testing we do on software.

Speaker 3

57:44

But this is the thing. It's no one's fault. That's an on the ground developer, because the way these models are sold is that no their magic like they are different to everything else. They are certainly not. They are the same as any other machine learning model, except slightly more problem because you're probably involving them in critical parts of generating code.

Speaker 2

58:05

Just be careful and measure please be careful. Yeah. So so, doctor Burchell, what's next for you? What's in your inbox? I?

Speaker 3

58:14

As I said, I'm heading down to Melbourne in a month for NDC. I'm going to be giving this talk actually the one I gave it Porter and I'm going to be giving one that I gave in Oslo just about the psychology of llms. If you will not be with me in Australia, you can also watch that on YouTube on the NDC channel. The moment that I'm laying kind of low, I'm actually going for my German citizenship, so nice. Yeah, I gotta do my citizenship test in June.

58:42

Just did my language test couple of weeks ago, and like everything in Germany, it takes months, so I may be able to apply by the end of the year.

Speaker 2

58:49

So you got to wratch it up your complaint too.

Speaker 3

58:52

I laughed, actually so hard, because a friend of mine did her exam and one of her writing tests was to write a letter of complaint.

Speaker 1

59:05

Well it started because before you came on, Richard, Jody says, how you doing, and I said, I can't complain, but I do anyway, She says free chairman.

Speaker 2

59:17

Awesome, all right, thanks Jody, really appreciate it. Yeah, thank you. What a great conversation, all.

Speaker 3

59:22

Right, Always always a pleasure, Okay, and.

Speaker 1

59:24

We'll talk to you next time on dot net rocks. Dot net Rocks is brought to you by Franklin's Net and produced by Pop Studios, a full service audio, video and post production facility located physically in New London, Connecticut, and of course in the cloud online at PWOP dot com. Visit our website at d O T N E t R O c k S dot com for RSS feeds, downloads, mobile apps, comments, and access to the full archives going back to show number one, recorded in September two thousand

01:00:17

and two. And make sure you check out our sponsors. They keep us in business. Now, go write some code, See you next time.

Speaker 2

01:00:25

You got Jack Middle Vans and

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript