Exclusive: How GPT-5 Actually Works

Speaker 1

00:02

Ze Media.

Speaker 2

00:05

Hi, my name said Tron, and welcome to Better Offline. This is also Jackass. So you've just had a cheery two part chuckle first about how Generative Ai made tanker markets in our economy. So I'm going to give you a lighter one an episode about GPT five, which is a model from open Ai, and why just under three years of hype have led to the software equivalent of the launch of Saint Anger, except every time lars are hit.

00:38

The snare drama cost them fifty five thousand dollars. Now, if we look at the positive reviews, we see takes ranging from Simon Willison's tempered remark that GPT five is just good at stuff to semi anass this is completely insane statement that GPT five is setting the stage for ad monetization and the open Ai GPT chat GPT super app.

01:00

In a piece that makes several assertions about how the router that underpins GPT five is somehow the secret way that Openaye will inject Dad's which is just distinctly silly. It's I'll get into this in the episode a little bit, but just with everything you're going to hear, you're going to realize that this is just someone just saying stuff. Took four bylines to do that shit too. I'm also British. I'm gonna say router. I might say router as well, because I've been here a while. Make fun of my

01:24

voice if you really must. But with that out the way, here's a quote from semi Analysis' coverage. Before the router, there was no way for a query to be distinguished, and after the router, the first low value query could be routed to a GBT five mini model that can answer with zero tool calls and no reasoning. This likely means serving this user is approaching the cost of a search query. This does not make any sense. This None of this makes it like it's just a bunch of assumptions.

01:50

Why would this be the case. The article also makes a lot of claims about the value of a question and how chat GPT could I am serious a agent agentically reach out to lawyers. I'm not going to edit that out because egentically is not a fun word to say. It is just complete nonsense, and in fact, I'm not sure this piece reflects how GPT five even works at all. Again, quoting it, the router serves multiple purposes on both the

02:14

cost and performance side. On the cost side, routing users to many versions of each bubble allows open ai to service uses at a lower cost or with lower costs. Even to be fair on semi analysis, it's not as if open ai gave them much help. Open AI's official writings about the router aren't exactly filled with details, talking and glowing terms about what it does, but not how

02:34

here's what they say. Chat GPT's real time router quickly decides which model to use based on the conversation type, complexity, tool needs, and your explicit intent. For example, if you say think hard about this in the prompt. The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured corrected correctness improving over time. Once usage limits are reached, a mini version of each

02:59

model handles remains inquiries. In the near future, we plan to integrate these capabilities into a single model. And that last bit really doesn't make sense, but in any case, the lordchip GPT five has been very, very weird. At first. Some people seemed really happy about it. Chief of them software YouTube of Theo Brown, who is over four hundred and sixty eight thousand subscribers. He's also known as theogg who said.

Speaker 1

03:20

I didn't know it could get this good. This was kind of the like oh fuck moment for me in a lot of ways, and I've had to fight like a slow spiral into insanity. It's a really really good model.

Speaker 2

03:39

He finished by saying, and.

Speaker 1

03:41

Keep an eye on your job because I don't know what this means for us long term.

Speaker 2

03:45

Pretty crazy, right. Comments on the video included people saying things like if open aye is helding you hostage, blink twice and yes that is an adverbating quote. Another saying this dude, is everything wrong in it today? Another saying this video was sponsored by open Ai, Another other saying ge GPT five failed every test project I gave it today. It's a lie in my experience. Maybe they haven't ramped up the GPUs now. From what I can tell, THEO Brown played with GPT five in open ais offices and

04:10

did all the benchmarking there. Open Ai, by the way, fucking how come on? You can't benchmark in their offices anyway. Open AI's API based access to GPT five models. You know the thing that you use if you want to integrate GPT into your app, does not root them, by the way, nor does open ai offer access to its

04:29

router or any associated models. Important detail. Just want you to know that because we need to make sure very clear now A weekly a Theo Brown would put out another video called I was wrong about GPT five, which he would open by saying.

Speaker 1

04:41

So first and foremost, I want to make sure it is very very clear that the experience that you probably are having with chat, GPT and GPT five right now is not the experience that I had when I was first testing it.

Speaker 2

04:53

Brown goes on to explain that he was not paid by open Ai at all, that he was sincerely impressed by the company and GA five, and that he'd actually spent over twenty five thousand dollars in inference testing it on his own company software, and indeed also that he turned down a grand appearance fee. Sorry, I mean that's a very British thing, one thousand dollars appearance fee, not

05:13

just like a really nice one. Brown claims he asked open Ai to try it out, and after they declined to let him test it early on his own, he was invited to try it on camera with a small group of other people open AI's offices where they'd film his reactions. He said that the API was incredible, but that it's become apparent that the models he was using in the video were not the same as those released

05:31

of the public. Making a post on August thirteenth on xd Everything app that GPT five was nowhere near as good as in cursor as when it was as it was when he was using it a few weeks ago, complaining that things that worked while demoing it at open ai no longer did, adding that there was something somebody else on Twitter that said they'd had a similarly great

05:49

experience GPT five on launch that has since decayed. It isn't completely clear what happened here, but I'm going to guess that open ai showed THEO Brown and others in their offices some sort of heavily molded version of the model that burns significantly more compute to provide its outputs, though I'm also very suspicious of how significance the difference is here. Brown's videos attempt to show the difference between the generations that you received from the model when it

06:12

was good and when it was bad. In this video, which I'll include a link to in the episode notes. But if I'm honest, they look pretty similar in that they're kind of mediocre. I'm not saying that as a hater, by the way. They just kind of look like shit. It's just kind of okay, like shit. They look like regular fucking generated websites. They don't look special. The good one is fine, and the bad one has weird gradients

06:34

on it. This whole thing sucks, though, and was a clear set up by open Ai to overstate the abilities of GPT five, one that fell apart with the lightest brush with reality. I imagine their assumption was that Brown would post the glossy video and then walk away, and it gave THEO some credit for straight up stating he was misled. This was a desperate move and one that blew up in the face of open Ai. Along with

06:56

the rest of the GPT five launch. People hate the model, customers are mad for taking models away like four to H and have remained mad even with their return, and the chat gpt subreddit is almost entirely people complaining about how ineffective the new version is and how even GPT four ROH is not the same They got game of

07:13

brain Baby. As I said in last week's monologue. I believe open Ai has grown a fandom rather than any kind of sustainable product market fit, and they're now suffering fandom like hate with every minor change they make in an attempt to push GPT five further, further aggravating people

07:27

that barely understand why they use the product to begin with. Yeah, the center of the angle laid the reason for GPT five's launch, the belief that this was somehow a cost cutting measure, where OpenAI had added a router to chat GPT as a means of sending certain requests to cheaper models to save money. But when I hear router, I hear latency, and I never or even a second believe that this would somehow be cheaper to run. It didn't

07:49

make sense. I'm a curious little criator, so I went and found out how chat GPT five actually works, and unlike the following incredible products that you should buy, it's actually kind of a big piece of shit. And we're back, and from here on out, I will define two things. GPT five referring to the model and its associated mini and nano models, and Chat GPT five referring to the current state of chat GPT, which features an auto fast

08:23

and thinking and thinking mini model selections. You also can see legacy models, but that's not what we're talking about today, and that's also only for a little bit. It's a distinction I have to make, by the way, and make earlier, because the two things are different, they work in different ways, and chat GPT five structure induces a bunch of trade offs and downsides that, as I'll discuss later, make this

08:43

whole thing even more wasteful. In discussions with a source that an infrastructure provider familiar with the architecture, it appears that chat GPT five is in fact potentially more expensive to run than previous models, and due to the complex and chaotic nature of said architecture, can at times spun upwards of double The tokens per quid tokens, for those who don't know, are basically chunks of texts that the AI models do stuff with. I'm simplifying this. Do not

09:08

email me and correct some minor thing nobody cares. A sentence like the quick brown fox jumps over the lazy dog will be broken into lots of smaller four character chunks. There are different kinds of tokens, and they're all priced differently. An input token refers to the data you send to the model when you ask a question. Output tokens are used to measure the size of its response, with bigger responses requiring more tokens. The more tokens you burn paquery,

09:30

the more expensive it is to run that query. The fact that chat GPT five can, in certain circumstances burn twice the number of tokens of query means that every question costs more. Chat GPT is also significantly more convoluted, plagued by latency issues, and is more compute intensive thanks to open a ey's new, smarter, more efficient model routing system. In simpler terms, every user prompt on chat GPT, whether it's in auto, fast thinking or Thinking Mini, starts by

09:55

putting the users prompt before the static prompt. I don't want to lose you here. This is important. A static prompt is the invisible instructions given by open Ai to chat GPT, in the models themselves and the tools associate with them to tell them how to operate. Instructions like you are chat GPT, you're a large language model, You're a helpful chat bot. Do not threaten them with a knife, and so on and so forth. These static prompts are

10:17

different with each model you use. A reasoning model will have a different instructions set to a more chat focused one, such as think harder about a particular problem before giving an answer. Break down problems into component answers. When you get a certain thing, like if someone asks you a coding question, query a coding tool. That kind of thing, a user prompt is exactly what it sounds like, the thing that a user wants the AI model to do.

10:38

The new order in chat GPT five becomes an issue when you use multiple different models in the same conversation. Because the router, the thing that selects the right model for the request, has to look at the user prompt. It can't consider static instructions first because they may be different based on what the user asked. In fact, the order has to be flipped for the whole thing to work.

10:56

But simpler previous versions of chat GPT would take the static prompt and then invisibly append the user prompt onto it. This static prompt would typically be cashed massively, reducing the amount of compute the model needs to perform a task. Chat GPT cannot do this. Every time you use chat GPT five. Every single thing you say or do can cause it to do something different. Attach a vile might need a different model. Ask it to look into something

11:20

and be detailed. Might trigger a reasoning model or a different depth of reasoning. Ask a question in a weird way. Sorry, the route is going to need to send you to a different model entirely each time, coming up with new instructions based on the subtle interpretation of what you asked in. Every single thing that can happen when you ask chat GPT to do something may triget the route to change model.

11:39

A request a new tool, and each time it does so requires a completely fresh static prompt, regardless of whether you select auto thinking Faster or any other option on chat GPT. This in turn requires it to expend more compute with queries consuming more tokens compared to previous versions.

11:54

It's like you started a job, and every time you do a task, right an email, make a cup of copy, attend a meeting, email someone with a threat your workplace requires you to complete the entire mandatory onboarding training first. One way that it is spreadsheet, not before you brush up on your anti biberary legislation. First your prick. As a result, Chat GPT may be smart, but it doesn't really seem

12:16

efficient in the GPT five version. Now to play Devil's advoca, open Ai likely added the routing model as a means of creating a more sophisticated output for a user, and I imagine with the intention of cost saving. Then again, this might just be the thing it had to ship. After all, GPT five was meant to be the next great leap in AI, and the pressure was on to get it out the door by creating a system that depends on an extern and or routing model, likely another LM.

12:41

In this case, open ai has removed the ability to cash the hidden instructions that dictate the how the models generate answers in chat GPT, creating massive infrastructural overhead. Worse still, this happens with every single turn as in message on Chat GPT five, regardless of the model you choose, creating endless infrastructural baggage with no real way out that only could pounds based on how complex the user's queries get

13:02

or how much they change. They could be simple, but just going in different directions every time, could open ai make a better router? Sure? Does it have a good one today? No, every time you message CHATGBT as the potential to change model or tooling based on its own whims, each time requiring a fresh static prompt, and short of totally reworking the architecture of chat GPT five, there's no

13:22

way to change this. And if it's an LLM choosing which model, I don't know, maybe it hallucinates just a guess. It doesn't even need to be the case where a user asks chet GPT five to think, and based on my test with GPT five, sometimes you can just ask it a forward question and it will think about it. For no apparent reason, open ai has created a product with latency issues and an overwhelmingly convoluted routing system that's already straining capacity, to the point that this announcement feels

13:48

like open ai is walking away from its API entirely. This, as a reminder, is the thing that people use to incorporate open AI's models into their apps while also running set models on the infrastructure open Ai rants from Microsoft and call even at some point as well as Oracle, and this API thing is really weird by the way because these are new models, but Open Eyes really not

14:08

talking about the models themselves that much. Unlike the GPT four rower announcement, which mentions the API in the first paragraph, the GPT five announcement has no reference to it and only has a single reference to developers at all when

14:19

talking about coding. Some woman has already hinted that he intends to deprecate any new API demand, though I imagine it will let anyone who will pay for priority processing, which is essentially open eyes way to require minimum commitments and extra payments from API customers just so they never feel the bite of any compute shortages and throttling, which

14:37

they absolutely will do to people that don't pay. Chat GPT five feels like the ultimate comeuppance for a company that has never been forced to build a product, choosing instead to bolt increasingly complex tools onto the side of models in the hopes that one will magically appear. Now, each and every feature of Chat GPT burns more money than it ever did before. Chat GPT five feels like a product that was rushed to market by a desperate company that had to get something out of the In

15:00

simpler terms, here, it's actually really funny. When I worked this out, I chuckled. I chuckled vigorously. This is just a case where open ai has given chat gpt middle manager. But now I'm giving you the chance to open up your hearts and do something better. Open up your wallets too, and send money to a company that follows here, But

15:19

hold my advertisements and we're back. Like every great middle manager, chat GPT five's rutter creates more work based on its own interpretation of what's going on, and has a separate large language model. I can't imagine it has a ton of training data available if I had to guess, and this is a guess by the way open ai has done, and we'll do a lot of fine tuning and reinforcement learning to make it work. Though, to give it a little grace, this is a new thing that it's doing,

15:57

and it's doing sort of a huge scale. The problems start, by the way, with the fact that chat GPT five is taking the user's initial prompt and then deciding which model to use, unlike previous models, which sent your prompt directly to the model along with the static prompt which was cashed and came first. An important feature in how these models, limit tokenburn. Open ai starts with a router model that makes takes what you ask and gives its chat GPT and tags it based on what kind of

16:22

thing your question might need. The thing might be a tool, such as whether it has to do a web search to spit out the thing at the end, a reasoning model, whether it needs to use a coding language, and so on and so forth. Once chat GPT has bounced your query across various models, burn and compute along the way, it then pushes it towards the chat portion of the generation.

16:42

And each time you ask chat GPT a question or to do something and you specialized static prompt is generated, sometimes several make it impossible to cash them in advance. In simpler terms, each time you message it, chat GPT is to dump all cased information and instructions for what you need to do and reload it with each prompt.

16:59

Now here's some examples of what chat GPT five has to reload every single time you prompt him whether or not to use a browser or search the internet, and under what conditions to do so, because they will change

17:09

with each prompt. How to approach a particular problem based on what the user asked, including any specific ways you meant to answer, tone, brevity, and so on based on their request, specifics around how it might use, say open ais code interpreter, such as the usage rules for running a Python script, or how you want the code's output,

17:25

which again will be different based on each prompt. And you can even say, do it in the exactly the same way, and because it's a large language model, it may hallucinate something different every single goddamn time you prompt chat GPT five it has to do this. Worse still, a particular conversation can involve you using multiple different models and tools, requiring you with each and every prompt, having to inject a different static prompt for each component that

17:49

chat GPT five uses. And you can't catch the static prompt before the user's intent because if you did that, it might send an instruction to a model that doesn't make sense, such as telling a reasoning model to give a quick and simple line answer remini or nanomodel to do some sort of deep reasoning, which would create a crappy answer and burn tokens in the process. And this is all thanks to the complicated way that open ai

18:10

insisted on building GPT five. Every single time you send something to chat, GPT can trigger it to use a different series of models audio vision, reasoning, each with their own instructions, static prompts, all while pulling different tools, each requiring their own instructions based on what you asked, and

18:27

reasoning models even have different depths of reasoning. Unlike four to ZH, which is a multimodal model combining text, vision, and voice, GPT five is a ratking of open AI's models and tools that gets reborn every single time you ask it to do anything prompt It can prompt cash some things, but the core instructions not so much. But let's get a little more granular, because I know I've been quite repetitive, but this is detailed. So from what I've been told, there are either one or two models

18:53

at work for the routing. I'm going to go with what I think is most likely based on the discussions I've had with people familiar with the architecture. I've heard the term orchestrator thrown around potential to potentially suggesting the router may be more omnipresent throughout the process, but I was unable to confirm its existence. Reach out of you here differently, I'll explain things as they were explained to me. Though.

19:13

When a user sensor prompt, it goes through the Splitter leg, which decides to send the query on one of two paths. One is called the fast path, where a query is straightforward, such as a text only conversation that doesn't require any analysis or extra tools or thinking, a path where the query may require reasoning or more complex tools like codgeneration

19:30

or access to web browser for research. To be clear, there are prompts where it may be split into multiple paths that trigger multiple models or tools, each requiring their own static instructions. From what I understand, the splitter model is a completely separate large language model, though we don't

19:44

have a ton of details about it. I also, based on conversations I've had, think there's a chance there could be a separate model that sits above the splitter that does much lighter classification of how a query might be routed. So you ask it to do something, it might just go Okay, this looks like it needs a tool and going off. Why now? In any case, none of this can be cashed because all of this exists before inference, which is where, by the way, it's inference I've misstated

20:07

in the past. Is like it inferring, meaning inference is everything that happens to get an output to you. So all of the stuff that's happening. And by the way, this is all a completely new cost that open ai has created. No one does this like this, it's so fucking stupid. But now we get to the chat leg. Now the open ai has added layers of extraction, it can begin cooking up the output, by which I mean

20:30

do inference. The chat leg is where the pieces that the splitter model created are pulled together, each loaded into their with their respective static prompts based on what the user asked chat GPD five to do. Each piece of the model a tool to generate Python and an image generation tool a reasoning model. To generate an output has to process an entirely new static prompt and again that's

20:49

every interaction. Remember, static prompts are effectively instruction. So the splitter model has told each piece of the pie how to act to create a particular output. As a result, much of this can't be cashed, creating more and more repetitious token bone response and mean to have to repeat

21:03

this stuff so that you really get him. The upshot of the chat legs static prompt baggage is that you can do a little more here, at least in theory, because each component can be instructed separately, they can again, in theory, be made to give more individualized, specialized outputs, like creating an image with tags that is as I'll give you an example of very shortly generated using a

21:21

specific reasoning model. I'm clutching it straws here. I don't really know if this's better, but I'm trying to be reasonable. I'm trying to be normal. Every day, I try and be normal. Previously, Open Eye's advantage was that a model like four to oh was a kind of a jack of all trades. But to get the benefits of chat GPT five and that's in air quotes, it's engaged a conductor model that can just make things more convoluted, even in the case of simple requests. Let me give you

21:49

an example. You upload a chart of NFL player's stats and ask chat GPT to decide which is the best of the group and create an image to show the results. In GPT four oh, chat GPT would use one more and thus one static prompt to look at the image, decide which tools to use, and then how to format the response. You only needed one prompt, which was cased because one model can look at the stats for all the data and make the decisions and then use the

22:11

image generation tool to make the final image. In GPT five, the chet GPT conductor model would see the stats, root it to a vision model requiring its own static prompt, then a separate text only reasoning model, one that has no ability to use tools, but it might be cheaper to get an answer from and also requires a static prompt, and that would then decide which players are best and then spit out an output, and then root it to a completely separate model that can generate texts to query

22:35

the image tool again need a stag prompt for this to then generate the image. On top of all this onerous baggage lies another problem. The GPT five's various models

22:44

are just more complex. By splitting out the component elements of what a model can do and allowing each model to have different levels of reasoning, even the cheaper ones like MIDI and nano open AI has created an endless combination of different reasons to have to make a brand new static prompt instruction automated by a router, a large language model that chooses what large language model to choose for a query. It is, if I'm honest, kind of funny.

23:08

Reasoning models work when simply described by breaking up a prompt into component pieces, looking over them, and deciding what the best course of action might be. Chat GPT's router is effectively an abstraction higher breaking up the prompt into component pieces, then choosing different models for each of those pieces, which may in turn be broken up by a reasoning model.

23:26

While I wouldn't say this is a hat on a hat situation, it is at this point unclear what exactly the benefits of chat GPT five's new architecture are, less hallucinations, better answers. Based on what I've been told, this was a decision made to increase the model's performance, what I can say is that this very likely increased open ayes overhead at a time when it needs to do the

23:45

exact opposite. Even if chat GPT five pushes people towards cheaper models, it does so while guaranteeing extra costs and latency and whatever signals it may learn as people use. This will have to create significant benefits massive one hundred percent plus game for it to be anything close to worthwhile.

24:02

While open ai is rude to may be smart in terms of nuance of how it might answer a query, and even that I question it most decidedly, is not more efficient and may have actually increased the burn rate for a company that will lose as much as eight billion dollars this year, and I think that number might be low too. Yet what I'm left with in writing

24:19

this script is how wasteful all of this is. Open Ai, a company that is already incinerated upwards of fifteen billion dollars in the last two years, has chosen to create a less efficient way of doing business as a means of eking out and monest the best performance improvements. It

24:35

just sucks. In our own lives, we're continually pushed and pressured and punished if we get into debt, judged by our peers and our parents, if we spend our money recklessly, and if we're too reckless, we find ourselves less likely to receive anything from credit to housing. Companies like open

24:49

Ai live by a different set of standards. Some Mormon intends to lose more than forty four billion dollars by the end of twenty twenty eight on open Ai, and graciously told CNBC, like Lord Farquad, he was willing to run at a loss for a long time where he was treated like he was this smart, reasonable decision maker rather than someone that needs to rein in their horrendous

25:09

spending habits and be more mindful. The ultra rich are rewarded far more for their errant spending habits than we ever are for any thrifty inness or austerity measures we make, and none of us are afforded the level of grace that Clammy sam Altman has been and has been feels appropriate. Chat GPT five is an engineering nightmare, a phenomenally silly and desperate attempt to duce what remains of the dying innovation and excitement within the walls of open Ai. It's

25:34

not November twenty twenty two anymore. And let's be honest, there really hasn't been anything exciting or interesting out this company since GPT four. There's nothing exciting happening at this company. As many as seven hundred million people a week allegedly used chat GPT, but nobody can really say why. An open Ai, despite its massive popularity. Cannot seem to stop losing billions of dollars, and it can't seem to explain

25:56

why that's necessary other than this shit's really expensive. Dude, Can anyone actually articulate a reason why we need to burn billions of dollars to do this? What are we doing? Why are we doing it? Has everybody just agreed to do this until it becomes a completely untenable Do we all yearn for the abyss so much that we can't find camaraderie and admitting we were wrong? Look at GPT five.

26:17

This is, if you believe the hype, the best funded, best resourced company in the world, with the greatest mind and its helm and the greatest minds within its wars. And this is the best they've gone. A large language model that chooses which large language model will answer your question. G fucking wit, Sam Mortman sounds dandy, and how much better is this? You say, Oh, you can't really say fucking brilliant? Hey does it do anything new?

Speaker 3

26:40

No?

Speaker 2

26:41

Oh, what's that? It's actually our job to work out for ourselves. Thanks man, I love it. I love this shit. And if you're someone that is a hype merchant listening to this and you've done really well getting to the end of the third part. By the way, I respect you. I want you to email me and explain why they should be justified in burning billions of dollars if you tell me, if you tell me Aws, I will eat you alive. I mean that, does it? I mean that

27:06

completely literally, I will unhinge my jaw. I'll eat you like Kirby and shit out of dance. I've said that one before, but I'm going with him in any case. This three parter has also really reminded me how ridiculous this is, how nonsensical things have become, and how much waste has been kind of justified, justified on this idea that this will become something by people that don't really know what it does today or might do in the future. None of this is going to end well, and not

27:34

even the boosters seem to be having fun anymore. Everybody's just flating around waiting for it to end. Even Sam Ortman seems tired of it all. I know, I bloody well I am. Thank you for listening to Better Offline.

Speaker 3

27:54

The editor and composer of the Better Offline theme song is Metosowski. You can check out more of his music and audio projects at Mattasowski dot com m A T T O S O W s Ki dot com. You can email me at easy at better offline dot com or visit better offline dot com to find more podcast links and of course, my newsletter. I also really recommend you go to chat dot where's youreaed dot at to visit the discord, and go to our slash.

Speaker 2

28:20

Better Offline to check out I'll Reddit. Thank you so much for listening. Better Offline is a production of cool Zone Media. For more from cool Zone Media, visit our website cool Zonemedia dot com, or check us out on the iHeartRadio app, Apple Podcasts, or wherever you get your podcasts.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript