#200 - ChatGPT Roadmap, Musk OpenAI Bid, Model Tampering | Last Week in AI podcast

⁠¶ Intro / Banter

00:11

Hello and welcome to the last week in AI podcast, where you can hear us chat about what's going on with AI. as usual in this episode, we will summarize and discuss some of last week's most interesting AI news. And you can also go to lastweekin. ai for all of the links for this episode and also for our podcast. Text newsletter. I am one of your hosts, Andrej Korankov. My background is of having studied AI in grad school and now working at an AI startup. And hey everybody, what's up?

00:40

My name is Jeremy Harris. I'm your other co host co founder of Gladstone AI, an AI national security company. You know that if you're listening to the podcast every week. So it must get tiring to get the bios every time, but at least now, you know, we're, I guess, we're getting there. Yeah, I wonder, we'll maybe consider returning your bios at every time. I don't know how many new listeners we get that justify that.

01:01

And I guess we can go in and jump to, addressing the fact that this will be the 200th episode of the AI. There's a few more on the podcast feed due to some interviews and stuff, but this will be number 200. Damn, I feel old. So let's quickly celebrate the bet for a bit.

⁠¶ Response to listener comments

01:25

Amusingly, there was also an Apple podcast review that just said, don't ever stop. So I guess so far pretty good. 200. We've had a good record of doing this. Jeremy, you joined. At episode 110, for reference. Okay. So I'm still, yeah, I'm, I'm on episode 90. Yeah. You're on episode 90 for anyone who hasn't been a longtime listener, we were doing this back starting 2020. We had a different co host Karen. podcast almost did stop actually in, 2022.

01:59

We had a bit of a break and then thanks to Jeremy, we were able to come back and keep going. So we'll see, hopefully we won't stop at least until AGI gets here. I think that's pretty cool. That gives us at least two weeks or so. I know. given that maybe a good time to make some changes maybe we'll retire the bios starting next episode.

02:19

And we'll also start discussing listener requested topics Towards the end of a show, since we have gotten more of those kinds of things, that'll be kind of the last section. After all the news, the only thing we'll be talking about as far as listeners is any corrections that we need to address. So maybe that will be. Better for a lot of listeners.

⁠¶ News Preview

02:41

Quick preview of what we'll be talking about in this episode. We have some cool new models in tools and apps. Adobe is coming out with a new store arrival, and we have some updates to kind of the timeline of LLMs. Applications and business, of course. More OpenAI drama with Elon Musk. Gotta talk about that. And some kind of previews of what's coming with other LLMs like Anthropic.

03:09

Got a few neat projects and open source stories with one text to speech model and some data sets, research and advancements, similarly talking about some distillation scaling laws and some tampering attacks for LLMs, which would be cool. Policy as safety, a bit of a mix of discussions between the U. S. and UK, some geopolitical stuff and paper also on utility engineering. And we'll cap it off with a discussion of some updates to AI copyright law, which has actually happened at long last.

03:47

Yeah. And then we will cap it off with a bit of discussion of listener requested topics. So hopefully you won't go forever and we'll actually get to that before we are completely out of energy.

⁠¶ Tools & Apps

03:58

And with that being said, let us get going in tools and apps. Our first story is Adobe's Sora Rivaling AI video generator is now available for everyone. So Adobe has had a whole suite of generative models of their own, starting with Firefly. Now their video generator that they call generate video is out for public beta access and it is available through the Firefly web app which also has image generation and some translation stuff.

04:34

There are two ways you can use it text to video and image to video so pretty similar to other things. Videos are output in HD 1080p. 24 frames per second, and you can generate them in 90 seconds or more. So it takes, you know, a minute and a half, two minutes to generate something. the length of those is up to five seconds. So yeah, this is a preview, of course. And I guess we'll see how long it takes to actually go out into their mainline tools.

05:09

They have been integrating various generative models into Photoshop, for instance, with generative fill. And Adobe is marketing this generate video tool as a production ready kind of tool for use in films about any copyright issues, which has also been their kind of line with all of our generative models that this is something you can use. When making a film and not worry about people suing you. So interesting to see if this actually does have a lag up against Sora and some other competitors.

05:40

Yeah, it really feels like we're reliving right. That, kind of on ramp the generative image, like text to image on ramp that we felt sort of two years ago, two, two and a half years ago, you know, over the course of about 18 months or so, you went from pretty slow, frustrating, unremarkable capabilities to like, you know, photorealistic stuff. And, and now. From one release to the next, you can't really tell the difference. I will keep beating this drum.

06:03

I think this is kind of, a race to the bottom on, pricing. Ultimately, it's going to happen here as well, where everybody's just going to be charging essentially for the cost of inference plus some small margin. This is really looking like it's going to be commoditized as well. I, I don't really see an intrinsic reason why video would be any, any different from images, but just to give you a sense still of the roadmap. So the video output that they can provide here is in 10 ADP.

06:25

24 frames per second. So we had decent ish maybe a little on the low end. but it isn't an upgrade from the original seven 20 P quality. So we're getting, you know, more and more into deeper HD here. Interestingly, still choosing to stick with that five seconds, right? By contrast Sora can do 20 seconds already. So, you know, you sort of see this trade off, like where do you put the compute?

06:45

Do you put the compute in generating longer videos that are logically coherent across longer periods of time, or do you invest your compute in higher resolution? You know, your four K, your 10 80 P that sort of thing. By the way, Adobe does say they're also working on a faster, lower resolution, what they call an ideation model and a four K model, which are both coming soon, supposedly.

07:05

But this is I guess an interesting use case and a bit of a bifurcation that At least and I'm sure there are things like this in text to image. You're gonna have to excuse me because I don't know that space quite as well other than what we we talk about on the podcast. But there's always in every domain we get deep enough. They're interesting tradeoffs. But with video, it's extra obvious, right? This this idea of like, do I go higher resolution? Do I put my computer into higher resolution frames?

07:29

Yeah. Or into coherence across a larger number of frames. You can see why open AI might've gone for the coherent side more than Adobe with these 22nd videos, just because coherence is at least intuitively a bit closer to the sort of reasoning that you need to make AGI, right? You have to be able to track objects and their interactions across longer periods of time, that kind of starts to look more world model like than You know, higher resolution image.

07:57

So again, you know, both, there's some truth to both of them. I just intuitively think maybe the, the longer time resolution of opening eyes Sora is a little bit more in tune with that. So, you maybe an interesting niche for them to carve out for themselves here at Adobe focusing on, you know, higher resolution video quality, 4k and all that, Right. And I think we did discuss this Preview, they posted some videos of it prior to which will be available.

08:21

So they are trying to differentiate in a couple other ways. They do say that, for instance, you can prompt a model with camera controls as one of the kind of common things things like zoom out and give me points.

08:35

This particular shot and my end, I will keep beating the drum, but I think image to video is a much bigger deal than text to video in general, as a kind of an application of special effects of you know, removing weird artifacts in your video, all sorts of practical applications I can see being used. So wouldn't be surprised if they're also emphasizing that. And also worth covering, in addition to this coming out, they have introduced new for Firefly with credits for these models.

09:05

So there's a Firefly standard subscription, which costs 10 per month, gives you 200 video or audio credits. Not too sure how much that converts to seems like maybe 25 second video generations. So not a ton and there's the pro subscription, which is 30 a month, which is 7, 000 credits. So about 70. Five second videos. So starting to commercialize it.

09:37

wonder how this will connect to their existing subscription tiers for their tool use for things like the editing suite and moving on next, we have open AI and them saying what we'll get to GP five as opposed to anything of that they are releasing. So Sam Altman posted on Twitter slash X, where he had. Not a super detailed, but a fairly specific idea of a roadmap they're looking at.

10:10

And the highlight is that they are trying to move away from their current paradigm, where you have like six models to choose from. They, you know, have GP40, You have all one preview, I think, still, or free mini now. And if you use chat GPT, you have this model drop down where you need to select which model you want to use for any given task. And you have, you know, various amounts of token usage per model. So it all gets a little bit easier. cumbersome as a specifically new user as well.

10:44

So what they are saying is that the new series of models will be sort of unifying this O series of reasoning models and the GPT series of models, where the model will just sort of know what to do based on your prompt. And so we won't be seeing, presumably O3, we won't be seeing anything until we get this kind of unified. tool. And apparently when it does come out, free chat GPT users will have unlimited access at what you're calling the standard intelligence level.

11:21

And plus subscribers will have a higher intelligence level. So, yeah, it seems like they're going to change up for user experience, which is interesting given that that hasn't really happened since kind of the early days of chatbots. Yeah, it's been good inspiration for me to anytime I offer any consulting work, I now have the two tiers. I offer them the standard intelligence level and then the high intelligence level. That's yeah, that's a good business model. It really is. It really is.

11:51

This podcast, if you're wondering is, is high, this is as high as I can go. So don't bother. But anyway so one, one interesting update too, is We didn't know that this was going to happen, right? The big update here is the bundling. Essentially, the as Sam says somewhere in his tweet, right? He's like, we hate the model picker just as much as you do and want to return to magic unified intelligence, right?

12:12

They want this feeling to be that you show up at the interface, the console or whatever you pump in your query, you just kind of get the answer from the model that's most appropriate for your query. And that does make sense. We did get I think a couple of weeks ago, Kevin while opening eyes, a CPO. Did say that they were on track to release 03 in the kind of February, March timeframe.

12:33

So this is still consistent if you kind of include it in that, that rollup though, Sam maybe gave himself a bit of wiggle room. He gave this ambiguous timeline of when these things would roll out saying it'd be weeks slash months for GPT five at least. So there is a little bit of ambiguity, but as you said, I like what you said, not high detail, but high specificity. That does seem like we're getting here. We're finally getting named. Releases. One thing to keep in mind, right?

12:58

So GPD 4. 5, Sam Altman is now saying that is going to be the model internally. That was code named Orion. We saw a big stink about that for a long time, right? The strawberry project, which turned into 01 and then Orion, which is turning into GPT 4. 5. And we've had this conversation on a podcast before a lot of people Talk about, you know, what will GPT 5 be able to do? What will GPT 4. 5 be able to do? And it's really kind of, in a sense, pointless to talk about things in quite that way.

13:27

Some people use that colloquially to mean like really the next order of magnitude or two rooms of scale, which makes total sense. But for others, there's this kind of fixation on just like the incrementing of the name. and that's just something to keep in mind. Open AI internally is building a crap ton of different models. they don't themselves know which one they're going to call what. And so for a long time, this thing was just called Orion and that was just the model under development.

13:51

And if it seemed like the vibes were right, they would release it as GPT 4. 5, which happened here, but that hasn't. You know, that wasn't locked in initially.

13:59

So just to kind of clarify there, you know, if you, if you're looking at speculation on GPT 4. 5, whatever the ooms are the thing that matter a lot more, the actual orders of magnitude of compute that's thrown at these models, the names, not so much, but in any case at least now we have a name to hang onto what was previously the sort of secretive Orion base model project. And by the way, that is a base model.

14:20

So GPT 4. 5. Is going to be the last non chain of thought model that opening I ships from now on, apparently, they're all going to be focused on well, chain of thought is a loosely defined thing, but they're going to be focused on models that use the kind of Oh three Oh one reasoning paradigm. That's at least going to be the default. And, and they will still have these models like GPD 5 that can one shot, but the goal is is to kind of orient more and more towards reason. Right.

14:49

And to preview a bit, we will be talking about some similar sounding plans that are entropic down in the business section because it was some business use, but the idea of unifying reasoning models of non reasoning models seems like it's kind of up in the air where there's no. Clear needs to separate, right?

15:11

You can just enable a model to be able to reason if it wants or thinks it's warranted or not to reason and just to output without this long trace, which actually does seem like a bit of a no brainer when you think about it. Like, why do you need. to separate these out. And it seems like a future maybe that you don't, that you have a single model capable of both paradigms. And as to the naming, I do think it's a bit interesting to just take a quick look back.

15:39

ChatGPT when it first came out late 2022 was built on top of GPT 3. 5 at the time. GPT 4 came out in March of 2023.

15:51

And then since then, we've gotten into the O model family, which was Omni, because they added things like audio to it, GP40, and now they have O1 as their reasoning model, so they did basically Stop incrementing in favor of having new types of model names to reflect new capabilities and new types of model from GPT 4, where it used to be GPT 2, 3, 3. 5, 4, as you said, that was always just going up in scale and improving on your intelligence levels. And since then, since.

16:29

You know, mid 2023, we've seen the kind of more focus on broadening what the models are capable of in terms of inputs, outputs, and I guess the way they do reasoning. Next up we have another bit of news from OpenAI. Apparently they're rethinking how AI models will handle controversial topics. So, this is part of a released 60 page model spec. Which is their guidelines for AI model behavior. And this updated spec does include details as to how to handle these controversial topics.

17:09

And seemingly they will make it so the AI is able to do some slightly more. I don't know, spicy things. So for instance, there would be availability to allow certain adult content in appropriate context. And there may also be variability for users to kind of enable or disable safety guardrails to customize that a bit. So, Yeah, again, not perhaps too surprising. I think many people using LMS do complain or criticize when models are overly restricted and overly careful.

17:48

So them moving away a bit could be a reasonable move. Yeah, it's, it's the model spec also is something we've covered before, but just, if you're, if you're new to the concept, In some ways, it sounds like a document that might be, you know, fine tuned on, you know, maybe just thrown at an LLM and, you know, fine tune it on this and that, you know, that tweaks its behavior. Some truth to that, but mostly it's a meta instruction document.

18:13

It's not just something that you load into a models context or just uses fine tuning, but generally describes the approach opening I takes. And that it recommends developers take at every level of the stack from, you know, data collection to training to serving the model to using it in applications. And so it's sort of like overall meta instructions to themselves and to others. It is a lot more kind of detailed than it was previously.

18:38

And as you said, there's a lot of emphasis on what they call intellectual freedom here. Sort of interesting. Yeah. You know, one might be tempted to suspect that this is sort of a response to the change in administration that, you know, previously they were perfectly happy Sam Altman having been like kind of a longtime Democrat. They were perfectly happy to kind of align with with the sort of you know, more, I guess.

19:02

The, the more kind of woke orientation of, of the chatbot as it had been previously. Now kind of switching over, which is a very Sam Altman thing to do. Kind of go with the winds of change and and that's sort of what you're, what you're getting here. It is, use, it is useful. And I, I think, a lot of people, as you said, sort of getting frustrated with, dealing with the model that was kind of overly constrained previously.

19:24

So yeah, I mean, there are a whole bunch of ways in which this thing gets used. It's almost like pointless to go through them all. but one sort of sub aspect here is, you know, there's like this chain of command that's outlined in the model spec that describes how the models are, you know, Are meant to prioritize different levels of instruction, right? You can think of like they're like platform level rules that open AI uses to override everything else.

19:49

Like these are like safeguard, safety safeguards or legal compliance things. But then you've got like a system prompt. So how do you rank the system prompt and importance to the prompts that the developer puts in? And they've got all kinds of measures. And we covered a paper that deals with this as well. It also contains instructions that they give to the developer. to human labelers during reinforcement learning for human feedback data collection.

20:10

And so the model spec really is a very all encompassing document in that sense. And, and yeah, there's a big blog post that comes along with it where they're sort of celebrating the higher adherence of the O1 series relative to the GPT 40 series to their specs. So, so they see, You know, across all these things that they care about, they have these funny sounding, sort of category names for the, for the, the things are trying to quantify and care about.

20:35

So adherence to the chain of command that we just talked about staying in bounds. So essentially giving responses that that are kind of that are consistent with the, the prompts they've been given seeking the truth together is another one sort of interesting, apparently a big improvement there for a one relative to four. Oh do the best work. So some notion of optimization for the actual correctness of the answer, use appropriate style. So on all those, you see improvements, you actually see.

21:04

A worsening of performance, though, on be approachable. So, oh, one apparently less approachable than four. Oh, that at least tracks my experience. But anyway, kind of, kind of interesting. And you sort of see it all laid out there. So they're letting us know as well. They won't be publishing more blog posts with every update to the model spec. But they do have this log that you can track, which is great for transparency, right?

21:23

You go check it, see, you know, see how they've updated their their meta instructions. And next we have Perplexity AI announcing a new ultra fast model. So they have this new version of their own in house model, Sonar, which is built on top of LLAMA 3. 37b. And they're basically saying that this is much better than other off the shelf models like Cloud 3. 5, Haiku, or GP40 Mini for the particular thing that Perplexity does, which is AI enabled web search.

22:02

It's one of the ways, similar to gpd with a web search to Gemini with a web search, you know, you enter a query, it looks up some websites and the AI then reads those websites and either answers your question or summarizes or whatever. So this one, Sonar is fine tuned for improved factual accuracy and readability. And they are claiming that it can run at 1, 200 tokens per second by using cerebrus hardware their wafer scale engines.

22:39

So, yeah, it's, it's similar to ChatGPT, I guess, in Perplexity, you do have a dropdown of what model you're using. So if you are pro subscriber, you are paying user, you can enable this model. well, it's interesting to see Cerebris getting picked up in this way for this sort of application. It's sort of like two challengers, right? In the space, the search challenger and perplexity, and then the hardware challenger and Cerebris, the design challenger. Kind of cool.

23:08

Yeah, the advantage of going with the kind of cerebrus for the search focused functionality, you can imagine like lower latency. You know, one of the things that wafer scale engine does is it gets rid of the need to for data to jump between different chips, basically, which which reduces latency. And you have advantages in, in memory access and some parallelism things too, potentially. So, I mean, it'd be interesting to kind of unpack more of the, this choice.

23:35

But it, there are reasons that it does make quite a bit of sense that maybe we can go into in a hardware episode 2. 0 whenever that happens. And one last story for this section, we have some updates on features coming to YouTube. So there was a yearly letter from the CEO of YouTube and there were some small announcements that, for instance, there's an audio auto dubbing feature.

24:03

Translating your video with audio in other languages automatically with AI as a YouTube creator, someone who posts videos, you will now be able to use it seemingly the, within the end of this month. And they are saying they'll also invest in tools for detecting and managing AI generated content, like being able to detect, aI likenesses that are being generated, which as we'll get to later, perhaps some actors and other people will appreciate

⁠¶ Applications & Business

24:37

and moving on to applications and business, as promised, we have some more drama between Elon Musk This seems to be like, and every week.

24:46

Kind of situation, the latest development here is that a consortium led by Elon Musk has made an offer of 97. 4 billion to acquire the nonprofit entity controlling open AI, so this is a bit technical, you could say, as open AI is trying to go For profit, they need to tackle their kind of complicated structure where the for profit arm of OpenAI, at least I think this is how it works, is controlled by this nonprofit entity.

25:27

And so the for profit arm needs to, in some sense, buy the nonprofit entity to then be able to transition to for profit.

25:37

So in this case, there's a competing bid essentially for that nonprofit that is in charge of the for profit Altman, as you might expect quickly stated that this is not going to happen, made a funny little snarky tweet to actually offering to buy Twitter for less than 10 billion, but it does seem to be like the latest tactic to at least delay and make it tougher to be able to do this because this is a pretty high bid. It's, it's more than the for profit part of OpenAI is likely to pay themselves.

26:16

Seemingly they're offering like 40 billion. So could be another headache. Yeah. And this is one where, you know, the, the I'm not a lawyer caveat is so important, but from the analysis I've seen. It does seem it doesn't quite interesting. So yeah, you alluded to it. You know, there's this sort of for profit entity that's trying to break itself free of the shackles of the nonprofit, right?

26:37

But the problem is that the nonprofit was essentially it's essentially Like it does have complete control over the for profit, which means that whatever value is contained in the for profit, the nonprofit has that. And then some, right? So how does a for profit it's almost like, you know, a slave buying its own master, buying their own master. Like that, that's sort of roughly what's going on here.

26:59

And it's not clear that that even makes sense because theoretically the slave is the master's property. I'm thinking ancient Rome here. Like, So, so, you know, like how do you even, how do you even work that out? This is a very mangled analogy, but you know, that's roughly the picture here. So for Elon to come in, like one of the key questions is Sam wants to sell the for profit to the nonprofit. He wants to sell it at apparently roughly 40 billion.

27:22

But for profit has to pay fair market value for the nonprofit. And. If there's a competing bid, a concrete competing bid, that's, you know, more than double, then that suggests that, hey, the fair market value is actually quite a bit higher. And it's not obvious that Sam Altman can even put together the amount of money that he'd need to compete with that bid from Elon. So, you know, like you don't just scrounge together a hundred billion dollars in liquid capital to make an investment like this.

27:48

In fact, he struggled to raise, you know, the 100 billion that they have raised so far for Stargate. and that was with a clear kind of build investment that they were orienting towards. So the other thing that I've seen argued is if opening I succeeds in their mission of building transformative, like general artificial, artificial general intelligence, then surely the value of the company.

28:10

Even the expectation value of, the company, the, the, nonprofit is way, way, way, way higher than 40 billion. And that seems super, super reasonable as an assessment. And so there's, I mean, again, not a lawyer, but, but this seems to argue pretty strongly in favor of some kind of consideration for, for the, the fair market value being a lot higher than the 40. Billion dollars that essentially Sam Altman is trying to sell to himself because he was on both boards. That's the other thing.

28:35

There's like this kind of conflict of interest thing going on where he wants to sell to himself. I know what a surprise the sale price is like a lot cheaper than than what I think a reasonable observer would would assess. But you know, all kinds of legal nuances and intricacies that I'm certainly not following. There are all kinds of all kinds of extra layers to this. So Larry Summers he's a, one of the directors on the opening.

28:57

I board says he has not received any formal outreach from Elon Musk and the consortium of investors here related to this offer. But this is weird because. Elon's attorney also says that he did submit the bid to OpenAI's board on Monday, so I don't know how to square the circle, I mean, not receive formal outreach from the consortium, is that maybe slightly different from, I mean, surely it's not, but slightly different from getting letter from his lawyer, I mean, surely he would Qualify.

29:30

I don't know. This is all a big thorny hairball, but the consortium is also quite interesting, right? The consortium that is looking to invest along with Elon in this acquisition. It does include XAI itself, which the idea is it could maybe merge with open AI following a deal, which is interesting. Thanks.

29:48

Kind of hilarious because opening I theoretically has a merge and assist clause in their, in their kind of mission charter thing, whatever, where they say, if, if another competing project that's value aligned, it seems like it's within about two years or so of making AGI, then we will join with them so that we don't have a late stage kind of competitive race. To make this potentially very dangerous technology.

30:10

It would be really funny and I don't think somehow I don't think this is how it's going to shake out, but it'd be really funny if they were sort of forced to merge and assist with XAI except because an offer came in from XAI and not, not out of their own volition. I don't expect that to happen by the way. my guess is that, you know, this somehow just. Doesn't materialize, but Hey, it's 2025. Weird shit happens.

30:34

So one last thing I'll mention there are a bunch of other interesting investors joining XAI in the bid, joining you on in the bid. So you've got Baron capital, Valor equity partners, you've got. Eight V. C. Which is a venture firm that's led by one of the co founders of Palantir Joe Lonsdale. So it's kind of interesting. A lot of a lot of movement here and we'll see.

30:56

Like again, this is where I really wish I was a lawyer so I could understand what the actual implications of this are how much it ties up open AI at the stage. But it's definitely a distraction for all concerned. That's that's for damn sure. Right.

31:10

And for context, right, this is coming pretty soon after a previous, I guess, attack a similar nature where our lawsuits also going on saying that OpenAI shouldn't be able to go for profit in the first place because we're set up as a nonprofit that was again, a lawsuit by Elon Musk here.

31:31

I'm also not a lawyer, but I'll give a little detail that I can, which is typically in corporations, the, you know, they are shareholders and they are shareholders in the nonprofit the companies that have invested like Microsoft and the board of directors of a given company. Has a fiduciary duty to shareholders where they are meant to be making decisions in the best interest of them.

32:01

So as far as I understand it, again, don't trust me on this, but if you are selling your company, you do have a somewhat legal obligation to try and get the best price possible. So that is one reason where this could be a pickle aside from the The board of directors of nonprofit could just want to get a bigger price. And as you said, lots of details to go through, like the nonprofit board was overhauled since Altman was, pushed out a CEO So now it's kind of Altman friendly board all sorts of.

32:45

Drama, I guess, to get into, yeah, I, I do think so again, not a lawyer, but I do think that the the wiggle room that exists for Sam right now is in part from the fact that the nonprofit does not have a fiduciary obligation to its shareholders. So as I understand it, it may be a different legal responsibility.

33:07

They have as a nonprofit to Like sort of, basically you can't fraud, you can't just raise a bunch of money as a non profit and then sell out to a for profit and effectively become a de facto for profit. the assets that the non profit has accumulated, it must keep or it must keep the equivalent value. And and that's where, I mean, such a good point. I forgot that a non profit is actually a non profit. No, but, but, well, but it's like, it's legally a nonprofit, right.

33:35

Which is worth considering because that does have implications. Right. Yeah. Yeah. I mean, like, honestly, my head hurts right now, but, like you're right. And anyway, so at the end of the day, My understanding is part of the argument hinges on no, like 40 billion is fair compensation for the nonprofit. Like there is some argument like that. They have to make that is not even close.

33:55

I mean, look, you've got opening eyes about to raise Whatever it is at like 300 billion valuation from SoftBank, right? That's the latest rumor. So if that's anywhere close to the case, 40 billion is a laughable kind of valuation here for the acquisition. So I don't know how that gets factored in either. That seems to undermine the Sam's claim too. But again, there, I mean, if he's going ahead with this, surely he wouldn't make such a move without.

34:22

Like, and leave himself this open to this kind of, lawfare. I, so I really don't know. But we'll see, we need a lawyer on this podcast. And the next story is about one of the other competitors of open AI on Froplic. And this is some sort of insider information. So not official communications from on Froplic, but there was a report from the information, which. Posts a lot of kind of internal stories that aren't officially disclosed.

34:56

So this report disclosed that Onfropic is working on an upcoming model that is a hybrid model. It can switch between two modes of deep reasoning and fast responses. This does track to, in an interview on Monday, the CEO Onfropic did say that They are generally focused on trying to make their reasoning models differentiated.

35:23

So, and they, yeah, he did say literally that Anthropic is puzzled by the idea of they are normal models and they are reasoning models and they're sort of different from each other. So it seems that they will make this unified model and there will reportedly be a sliding scale alongside the model. To allow developers to control costs, because of course, part of the idea of a reasoning models is they do more reasoning, they do more thinking, which leads to more tokens, which leads to higher cost.

35:55

So that is one of the implications. Of using such a model. So as I kind of previewed tracks very much with the stated roadmap for open AI by Altman. Yeah. It's still no word on pricing. of the things that is missing or that seems to be missing right now is models that have an intuition for when they should apply more compute to a problem kind of within themselves. So rather than having an externally tunable.

36:25

You know, dial that allows you to say, okay, you know, put, put in this, this many flops or this many tokens on average in your response, like having the model actually figured that out for itself. I don't know that that's fully separable from the intent of the prompter. Sometimes you, you know, that like, you want to ask a question at different levels of detail, but there is a dimension of this where some of that could be offloaded the system. That doesn't seem like it's this.

36:47

I'm just kind of noting that that's something that we still haven't seen get dug up yet, but in any case, kind of interesting, this would make anthropic first out the gate with this, this capability. Apparently one of the things this information piece says is that when the anthropic model is allowed to think for the maximum amount of time, when you tune that dial, in other words, all the way to the right it does apparently outperform the Oh three mini model set on high.

37:13

So that would make it nominally if nothing changes in the next, you know, You know, a couple hours the most performant model that is released. They are focused more on the enterprise market. This is the speculation in the article is like, okay, maybe that's part of the reason why they're focusing more on this dial.

37:28

It's putting more effort into features that give developers more control over the cost, the speed, you know, the pricing through the sliding scale approach rather than, you know, opening, I has these kind of three settings, you can go low, medium, high. And. Some people feel that it's hard to predict how many tokens the model is actually going to process with each of those, you know, at each of those levels.

37:48

And so, it's difficult to predict how expensive your query is going to be, you know, here you have Anthropic kind of leaning into giving a bit more control over that. Last little bit of detail that was in this in this report. Apparently the differentiator now for anthropic remains the understanding of complex code bases.

38:09

So one of the things that we've consistently seen and that I've experienced at least is like you know, cloud 3. 5 Sonnet new, just like really, really good at coding better in some ways in some contexts, then you know, Like than any of the kind of opening eye products that you can use any way at a comparable price point. So it's you know, this is apparently going to be going to persist, especially they're orienting towards large complex code basis.

38:35

So really moving towards automation of full on like software engineering, where you're looking at thousands of files making complete like lines of code that work just the first time.

38:46

whereas opening eye kind of is better at, as they put it, more academic problems like competitive programming, which can reflect a little bit of metric hacking too, because there's so much more benchmarks and metrics that focus on like competitive programming than necessarily just the kind of like, well, suite bench, actual programming, right? Yeah. Exactly. Yeah. And Next we have AI chip startup Grok has secured 1. 5 billion in a commitment to invest from Saudi Arabia.

39:19

So Grok, that's Grok with a Q, not Grok with a K. Is a leading competitor to NVIDIA as a hardware provider. They have their own kind of customized hardware solution, also similar to Cerebrus. And they are, have been at it for a while. We covered them getting some funding like October ish of last year. They had a series D round with 640 million while.

39:47

Yeah, now they have this 1. 5 billion investment from Saudi Arabia, which is tracking, tracking with a bunch of patterns of them being able to raise quite a lot of funds they are currently, or they have been valued at 2. 8 billion since August. And also Saudi Arabia being a major investor, as I think you've said last episode, like if you're talking about raising billions of dollars, Saudi Arabia is one of the. Organizations you might go to. Yeah. Sovereign wealth funds are basically it. Right.

40:21

And, and the oil producing nations, the UAE, Saudi Arabia, you know, they just have that money in droves there. They're flush with cash and looking for ways to spend it too on, on technologies that future proof them. Cause they're so dependent on oil. there are reasons to think actually that, that demand Might go up in the future and not down, but still overall not a bad call for them to diversify more right now. They're so leveraged on the oil side. This is also really interesting.

40:45

So first of all, we, we don't actually know what the specifics of this deal are. Right. So it's, it's phrased, at least in the article is has security 1. 5 billion commitment from Saudi Arabia to expand the delivery. Of their chips to the country. They don't say, is this an investment? it reads more like an actual deal than, than a, like a sales deal, some kind of partnership. So unclear if it is an investment, by the way, I mean, knowing what the valuation is would be quite interesting.

41:14

Cause if they just recently raised a 2. 2 0. 8 billion, and their valuation hasn't increased by a ton more than presumably they're giving it away an awful lot of equity in the process, but actually in August. So yeah, anyway, plausibly they could be way up there. Big, big deal though. Also interesting when you're looking at, so this is all about domestic partnership with domestic companies. There is already an arrangement between Grok and Aramco digital. So Aramco is a big Saudi oil company.

41:45

Aramco digital is kind of their, their tech subsidiary. So. This is especially interesting because now we've talked about the two big things that are blocking a lot of countries from building AI fast. One is energy. Saudi Arabia's got that. And then the other is chips. And this grok partnership would be really interesting from that standpoint. So yeah, grok specializes in inference, right? So this is not, these are not training chips. But inference, right?

42:07

So Has become a lot more important recently. And so yeah I guess one to watch. This is a big move into capital and no question there. And speaking of customized AI hardware, the last bit we'll cover is sort of update on a story we've been tracking, which is opening, I planning to build their first AI chip. Custom kind of in house chip design for AI inference.

42:34

So what we are hearing is that OpenAI is collaborating with both TSMC and Broadcom to develop the chip and they are expecting, at least planning for it to be ready by 2026. And we also got the news that they are looking to using AMD chips to train alongside. NVIDIA GPUs, which is primarily what we have been using. So not too much of a detailed update, I guess, but worth noting that they are continuing to push in this direction. Yep. We don't know a ton about the chips, as you said.

43:11

So the, just a couple things we know that the designs are going to be finalized in the next few months. Then there'll be sent to TSMC for fabrication and the mass production is targeted for 2026. We don't know when, but sometime in 2026, these are presumably going to be the chips that hydrate the Stargate data centers, right? So they're moving very much into their own custom, custom designs, and they will apparently use TSMC is three nanometer process.

43:35

So this is going to be like the equivalent of, of say, the Rubin generation with the next generation after the black wells, which are going to use three nanometers as well. So that's, that's where we're headed with this. one thing to note, right. This is an industry trend we're seeing with Amazon, the tranium chips we're seeing obviously Google's had the TPUs forever.

43:54

Microsoft has been working on their like Athena chip line and all that stuff for a long time, I guess started fairly recently, but you know, a long time in AI timelines and now open AI. You know, one of the things that this does do is it can make it harder for these companies to use like third party data center builders. That certainly like co location providers because the, the infrastructure becomes nonstandard, right? Like everybody has different shit going on.

44:19

So not everybody's just using the plain old, like, you know, B two hundreds or the H one hundreds. Now people are using their own custom chip lines, which can come with very distinct, like power density and, and thermal challenges and characteristics kind of cooling infrastructure needed there. And all that.

44:37

And so your, your data center footprint can be quite different, which, which then makes it harder to kind of build general purpose data centers to accommodate all these diverse diverse chip lines. So that's, you know, something that might affect quite a large part of the supply chain actually going forward into, into 2026. And, and those data centers are being built now.

44:54

So. it's definitely already being felt, kind of up and down the stack and we'll keep tracking this really closely because it is a really important part of the scaling story

⁠¶ Projects & Open Source

45:03

and moving into projects and open source. First, we have the company Ziphra announcing and releasing the beta version of a model called Zonos, which is a text to speech model with high fidelity and the ability to do voice cloning. So we're calling this Zonos V 0. 1. So I guess they're planning to build a lot more on this. Trained on approximately 200, 000 hours of speech data with multiple languages, including English, Chinese, Japanese Spanish, and German.

45:42

It can do voice coding with a short speech sample, 5 to 30 seconds. And just. Having been tracking open source models for a while, I think it is quite notable because this is a type of model that is a harder to develop because text to speech in general, you know, there aren't as many data sets for it that are publicly accessible, much harder to train a model like this than a large language models, least if you want to train a model. Foundation model that is quite big. So they are releasing this.

46:20

You can try it on their platform. They have kind of a Ziphra playground where you can also use their on custom chat bot as well called Zamba, which looking back, I think we covered Zamba whenever that happened. I guess we'll see. also seems like this is meant to be for Used for, or possible to be used for real time applications running on a GPU.

46:46

So feels like we're developing this in an effort to kind of catch up and, and introduce the ability to do audio chat with a chat bot, similar to what you have in chat GPT and Gemini now. Yeah, it is quite interesting. Apparently they're trained. So the differentiator here relative to other, other products and strategies models in this area is the simplicity of the approach. So it's a simple autoregression task. All they're doing is predicting a sequence of audio tokens, given text.

47:22

And audio tokens as the input. That's it. So basically, like, roughly like text autocomplete, but for audio rather than using a more sort of structured approach, which is what other text to speech systems have historically involved. So you're like a common thing you might do is first convert text into So I think that's it. Some like kind of hard coded, predefined features.

47:45

So thinking here about like spectrograms that kind of tell you how energy is distributed among different frequencies in, in whatever is going to come out you know, the duration and so on, like these kinds of characteristics of the, the output. And then at a second stage, converting those features into the actual waveforms. And so we've, we've seen that with architectures like fast speech and, and tacotron, they're kind of like famous Approaches that use the strategy. This is not doing that.

48:11

It's just a very simple kind of like go chug along and do your, do your auto complete on the audio and, and see what you get. And it just does perform really well. This would be seemingly another example of the bitter lesson, right? This is just a simpler strategy with fewer inductive priors, and it works better once you reach a certain level of scale two phases to their training. First phase is, you know, They're using their pre training with just just text prefix and speaker embedding.

48:37

So, you know, who's actually speaking. And the second phase is they add some, some additional kind of conditioning inputs some, some constraints and and upweighted higher quality data. So. Two stage process. Not not actually that different from what you see with standard pre training for language models. Start with your general purpose data and then your higher quality data. Introduce it later on in the process. Right?

49:00

Don't don't kind of like waste your really pristine high quality data on training the model when all it's really looking to learn is the basic rules of grammar and syntax. Wait for it to master those basic rules and then later on the kind of more refined, high quality texts so that it learns the facts contained in that text disproportionately. Yeah. So interesting model, you know, we'll see it come out and enter the open source world.

49:23

And that's its own interesting thing from a malicious use standpoint, right? You have these really good text to speech models that are open source that can be used out of the box really easily, but also also modified. So that's, that's pretty interesting. Last thing, apparently, so 200 to 300 milliseconds of latency with 4090. So, so basically like, you know, pretty, pretty cheap in relative terms, non data center GPU. So that's quite something 300 milliseconds.

49:51

That's, that's what you need to have a pretty fluid conversation. So this is Pretty solid model, right? And speaking of a model, a couple more details, they have two versions of it, both at 1. 6 billion parameter models. They have a transformer variant and also an SSM hybrid model that has both recurrence and attention. And I do believe Zamba also had that. They are releasing this under the Apache two license, meaning that it's not very commercial companies can use it.

50:24

And of course, researchers and so on can as well. You can kind of do whatever you want. So, yeah, I haven't tracked this super closely, but it does feel like a fairly significant entry. Like we have a ton of chatbots that are open source and permissive, but not too many text to speech models. And. In addition to having the model out there, they are also hosting it and you can pay for the API. They are charging 0. 02 cents per minute, and also have a monthly subscription option.

50:59

So we're going to be trying to compete with 11 labs. They're interesting from a business side, but potentially, because 11 labs is very much a leading player. And next up, we have a release actually from some universities and not from a company called gemstones, a model suite for multifaceted scaling laws. So it's a paper. That comes with this suite of models that is meant to enable the study of model design and selection on scaling laws.

51:31

So they're open sourcing over 4, 000 model checkpoints trained on more than 10 trillion tokens. And then they make that possible to see depending on things like model width, model depth. Depth, what happens to scaling laws. And they are, you know, a bunch of findings here that, are pretty significant, I suppose, from a model design perspective, where depending on what the shape of your model, so to speak, is the scaling law you should expect will not be always the same.

52:13

So there's kind of a specific amount of width and depth and other parameters to go with to be optimal. Yeah, it's actually, it's quite interesting. It builds on, we've had hints that this stuff is, is true, but this is the first time that we're getting kind of really nice, clear hints. Kind of quantifying at least in the, in the public domain quantifying these trends.

52:38

So you know, for example, what they find is that the optimal width to depth ratios of a model actually increase with model size relatively slowly, but they do increase with model size. So in other words, you want, as you scale up, instead of having More depth, like, like more stacked layers, you're actually gonna want to take that same number of parameters and instead make the model a bit wider. So make the individual layers wider and have fewer of them.

53:06

That's something that, you know, back in the day, like way back in the day, thinking like 2015 or whatever, whenever, you know, Google net or whatever was coming out. the, the sort of intuition people were developing, and you saw this play out with the ResNet models as well, is like, you just want to have as many layers as you can. Go deeper, not wider.

53:23

This is now saying, well, hold on a minute if you want to do this in a compute optimal way, if you want to get the most bang for your flop then what you need to do is have a, a kind of wider model than you might otherwise think, especially as you scale up. It is a relatively modest effect, so about 2. 5 fold increase in the optimal. Width to depth ratio, despite about a million fold increase in computer budget. So increase your, your compute budget by a million by a factor of a million.

53:50

And you're going to see about a two and a half X increase to the optimal width to depth ratio. But the impact of, of not following that, that optimal ratio is non trivial. So if you have sort of skinny models and you're, you're With depth ratio is too low, you waste around half your flops compared to a more optimal architecture to hit the same loss value. So you know that that is something right?

54:15

50 percent of compute, especially when you're talking about multi billion dollar training runs as we're about to get into in the next beat. That's that's quite significant. and actually the impact is even more dramatic for the actual cool. Time the wall clock time that it takes to train your models. So there's the cost and flops, it'll cost you about half your flops. But in terms of wall clock time, it could be even more significant.

54:36

They're citing anywhere from roughly 200 to 300 percent more GPU hours compared to more optimal architectures. They do caveat, by the way, all these findings by saying they only used one form of parallelism tensor parallelism in their training setup.

54:51

So parallelism is when you, you slice up your model literally like the layers themselves put a little chunk of a layer in one GPU, a little chunk of that layer, another GPU and another rather than having say like you're, you know, full layers in each GPU or or data parallelism. That's kind of compounding this. Usually the pipeline parallelism, the data parallelism, tensor parallelism, they all work together. All they're using here is tensor parallelism in their setup.

55:17

And so they're, they're caveating that, you know, it may not generalize when you take other forms of parallelism into account. one result they also find is overtraining. So this is a theme we've seen more and more, right? Is if you actually want to get the best performance out of your model, serve, like on a per flop basis on a per compute basis, and you don't care how big your model gets, then you want to grow the size of the model with the compute budget, right?

55:41

There's kind of a scaling law that applies there. And that's the Kaplan scaling law. Sorry, not the Kaplan. That was the original open AI one. There's a Hoffman scaling law, the so called chinchilla scaling law tells us. Right. How do you, how to do that optimally? But it turns out that you actually don't want your model to grow too much for a whole host of reasons. Quite often, right? If your model gets too big, then actually running inference on it gets really, really expensive.

56:05

And so, so in practice, what people do is they don't grow the model. They just pump in more. More compute and that results in what's known as an overtrained model. We've talked about that quite a bit on the podcast makes inference a lot cheaper and you get to amortize that inference cost across a whole bunch of users. Now what they find is that overtraining, so training longer than theoretically optimal for your model size is actually pretty good. Pretty efficient.

56:29

It results in only small kind of performance drops relative to the compute optimal case, which is encouraging for the current paradigm. Basically, the larger the compute budget, the more robust you are from slight deviations from the compute optimal model size. So quite an interesting paper.

56:45

Yeah, there's, you know, they do look at a whole range of different models from from 50 million to 2 billion parameters with a whole bunch of, you know, depths, widths, and they look at different training schedules and stuff. So the first, I guess, model zoo catalog of models that really lets people get their hands dirty and just study, you know, how different architecture choices affect scaling. This is, by the way, the kind of paper that you should expect Yeah.

57:08

Has been a general knowledge in the private labs for a long time. But this is again, the first time we're seeing it in the, in the public domain. And just one more story we'll quickly cover and move on to research. There are, A group of researchers have introduced Hephaestus, which is a large language model designed to improve agent capabilities for continual pre training.

57:35

So this is trained specifically on a large data set of tool documentation, which is made for LLM pre training with API function calls. So they have this, Battleset, Hephaestus, Forge, which adds in all the stuff, tool documentation, function, calling code function calling data, code, text data, to the training recipe so that you can pre train an LLM with agent. Type capabilities, which often involves calling functions out of a gate.

58:12

And so they train a model variant to have faced this eight B, and then they show that at kind of that range, it is able to work well

⁠¶ Research & Advancements

58:24

and onto research and advancements, we begin with a paper titled model tampering attacks, enable more rigorous evaluations of LLM capability. So typically when we have things research on model alignment, for instance, what happens is there are various prompts that you give a model and see if it does what it's supposed to. All right, so you try to elicit some behavior of it, which is what they call an input space attack in this paper.

58:59

What we are focusing here is another type of approach under the family of model tempering, which is when you can actually mess with a model internally. So there are some examples you can do a latent space attack, which is perturbing the, hidden neurons at kind of inference time. There's also what they're calling a weight space attack, which is when you can fine tune, you can train the model to, for instance, forget some rules That it's supposed to follow.

59:32

So they are building off a bunch of research. They're looking at some defense mechanisms for unlearning methods jailbreak refusal, tuning. They also have attack models and they're citing a whole bunch of papers from 2023, 2024, where people have introduced different ways to do this. And so they kind of try out all these variations and have some interesting conclusions as to the sorts of insights you can get about a given model by studying the success of model tampering attacks.

01:00:14

Yeah, we've seen a lot of results like this. I mean, this isn't a shocking or surprising paper argument. You know, very consistent with other things that we've covered before. The one thing I thought was quite interesting to, to highlight and that I haven't seen done elsewhere is they, they do this interesting kind of PCA.

01:00:33

Analysis. And so I just want to, maybe if you're not familiar with like dimensionality reduction stuff, so roughly speaking PCA is, you know, if you have a huge spreadsheet of different very high dimensional data, let's say you want to reduce that data down to just a two dimensional form. So you can actually look at it on a plot and ideally you want to do that in a way that preserves, let's say the. the closeness of points.

01:00:59

So if points are close together in that high dimensional space, you want them to be close together visually on that low dimensional space. So your, your visualization is somehow meaningful, preserve some of the meaning in there. In fairness, what I've just described is more like TSNE. but PCA kind of from an intuition standpoint, that's roughly what's going on here.

01:01:17

So what they do here is they look at spreadsheet of a whole bunch of attacks and a whole bunch of defenses, and they look at the attack success rates on some bio attack benchmark. And so, so for every, you know, attack and, and defense strategy, you have a score telling you, okay, well, if you use this, this defense strategy and you use this attack, here's how often the attack will succeed, right? So you have that matrix and they're actually going to do.

01:01:45

Principle components analysis PCA to see like how much of the variance in that data set can we compress down to just three dimensions rather than the 11 that they have here. And what they find is three principal components. Three dimensions explain 89 percent of the variance in the data set.

01:02:01

In other words, they're able to, roughly speaking, capture 89 percent of information contained in that spreadsheet with all those attacks and all those defenses by retaining just three, Dimensions worth of data.

01:02:12

So this suggests that even though a lot of these attacks seem to affect different mechanisms or seem to kind of apply to different work in different ways, let's say in reality, their function, their success is owed to It's about three dimensions of sort of like features or, or almost the physics of the model. There's really just three things, if you will, that you need to track in order to explain this behavior across all these 11 different attacks, which is sort of encouraging, right?

01:02:43

Knowing how well a model resisted just a small number of attacks could give you strong predictive power about how it would resist other attacks. That's sort of the, the opportunity that's implied in that, that result. So kind of interesting and, maybe cause for some optimism because it means that latently, like fundamentally the problem of defending against these, these attacks might actually be a bit simpler than, protecting against 11 different attacks.

01:03:08

Maybe it's more like, no, you just need to find three more fundamental. Like principles, let's say that are being leveraged by these attacks. and that'll give you, you know, good coverage, but anyway, I hadn't, hadn't seen that argued before. Right. And I guess to highlight the. Ultimate focus of a study is not just on the attacks themselves. It's about evaluations of LM capabilities. So if you're developing a model, can you then, get a sense of how safe it is, can you evaluate it?

01:03:41

And so to that point of the PCA, one of the finding of errors is. fine tuning attack success is empirically able to predict the upper bound of input space attack success. So if you're doing Laura fine tuning methods based on how well that goes, you can predict how well just bad prompts will work as well. So generally introducing some insights as to being able to predict Things like what prompts are likely to work how vulnerable is your model to various types of attacks. On to the next story.

01:04:26

And we again, talk about distillation law insights. This one a bit less focused on research artifacts, which is why it's in the research section. The title of the paper is distillation scaling laws. So this is looking at In the setting where you want to do distillation, meaning you have a teacher model, that's a very presumably big model that is very capable.

01:04:53

And you have a student model, which is a smaller model that you are training from the teacher model to get as much of its capability as possible while costing less to do inference with. So what they are providing here is a compute optimal distillation strategy where you already have a teacher model. And when you want to train a teacher model and then distill it, so that gets a bit tricky, right?

01:05:20

Because when you're doing both the teacher model and the student model, then your scaling law gets a bit weird because you can allocate your compute budget for training the teacher model more and the student model less, or you can You know, really trained the student model more and the teacher model less. So it gets kind of tricky. And their ultimate conclusion is that there is a bit of difference.

01:05:46

So if your student size is small and the amount of compute you can use a small, then you mostly want to go with teacher pre training. If you have more compute but still a small student size, you want to kind of evenly divide between the student training and the allocation of teacher inference for creating the data set, less teacher pre training, and then it gets even different for a large student size, small compute budget, large student size, large compute budget.

01:06:21

So as you might expect, also in the paper, you get a bunch of plots. They are Showing you can get a pretty fit to the data across different allocations of budget and model size. Yeah, it is, it is pretty interesting, not just empirically, but also sort of from the theoretical standpoint. They come up with these fairly, I don't know, fairly elegant.

01:06:44

Maybe fairly, they are a bit messy, but expressions anyway for the calculation of the student cross entropy, basically the, the loss that the student model can achieve, and they managed to separate it into a term that. is just like the teacher cross entropy plus a term that specifies the student's ability to mimic the teacher.

01:07:05

And so if you want to minimize the student cross entropy, in other words, if you want to get a student that works really well, that performs well, you have to both minimize the teacher cross entropy. Make sense. Make the teacher kind of smarter. But then also you want to improve the student's ability to mimic the teacher.

01:07:21

And they're codifying that explicitly in the math in a way that makes it very easy to tell what it would take to kind of improve, for example, the student's ability to mimic the teacher and how all those things get traded off. The, you know, the variables that you have control over here are the size of the student, the number of tokens, the that the student is trained on that are derived from the teacher model.

01:07:44

Then there's the size of the teacher and then the number of tokens that the teacher model is, is originally trained on. So those four variables go into, to this equation in ways that are really easy to follow. But kind of quite detailed. The other thing that they find is, so The teacher models influence on the student is entirely based on the teacher's cross entropy.

01:08:07

So in other words it doesn't matter like if the teacher is really like large if it has a lot of parameters or if it has, or if it's trained on, on a lot of data or little data. All that matters, as far as the student is concerned, is how well will the teacher perform on an entropy basis? And once that's kind of settled that determines the performance of the student model, at least component that is dependent on the teacher. Kind of makes sense, right?

01:08:33

Teacher performance is the only important aspect of the teacher that determines student performance, but they do kind of demonstrate that. So yeah, bunch of interesting scaling plots across the board. It's again, one of these problems that has certainly been solved in the Frontier Labs, or at least you would expect given the, the amount of money that they're investing in these training runs, you know, multi billion dollars certainly hundreds of millions of dollars already today.

01:08:56

You know, there's a team of full time people working on these scaling laws internally, but again, interesting to see this play out in the public. And worth noting, this is a paper from Apple. Which we've seen some research come out of them, but not too often. So interesting. And also tracks with Apple's seeming strategy of focusing on smaller models and not large models and trying to crack sort of the training recipe.

01:09:23

Some of their other research has also been on kind of a training recipe for different types of models, including I think vision language models is another one they published. So. As you said, various insights here.

01:09:35

One other one that I think is worth noting is when teacher training is included in compute, the best student cross entropy is always higher than in a supervised setting, meaning that, If you only care about how well your smaller student model performs, it's generally better to just go ahead and train it from scratch and not train a larger model and distill it, which I suppose is maybe intuitive, but still one of these insights that's good to know.

01:10:06

And moving on to the lightning round, we'll try to keep it a bit shorter with this next couple of papers. First, we have Matryoshka Quantization, a paper coming out of I'm so glad you had to say that, not me. Oh, man, that's great. Because now I can just say like this approach, blah, blah, blah, every time I say it. They do say, they call it math quant. Which is I guess it's ModQuant, technically, but anyway, it's, you can go ahead and not say much for us.

01:10:37

So the idea here is that, you know, typically you have different quantization amounts. You can do, you can do int eight, you know, int four, int six, this is the resolution at which You store your weights. So the lower resolution, the fewer number of values they can take on, which makes it much less expensive to do operations of them, to do multiplications, et cetera.

01:11:05

You know, they're kind of smaller and simpler, but it also lowers performant because each weight Can take on fewer numbers of values. And the idea of this paper is that you can train, a model at multiple levels of precision at once. So the Weights can be, combined for, you know, let's say int2. There's a nested structure of integer data types, which allows a single model to operate at multiple precision. Levels by sharing the most significant bits.

01:11:48

So you're sharing the last couple of bits of the weight, meaning that you can basically scale down the amount of compute without training multiple different models and multiple quantization methods. I think that's right. But Jeremy, I'll also let you jump in and explain. At this point, does it matter if it's right? Cause you nailed the pronunciation of the quantization method, which I think is, is the most important thing. No, I mean, this is, so this belongs to category.

01:12:20

So I was, lucky enough to have a chat with Nathan LeBenz from the cognitive revolution podcast, which by the way worth checking out. He does some great deep dives on, on and interviews and stuff. But one of the things he. Said we were talking about streaming de loco and that was a model that we covered last week.

01:12:37

And he was saying like, I'm so surprised that Google published that like that they, cause like it is, this is one of those like really kind of important from a policy standpoint, you know, the decentralized training and all that, but it's also the kind of paper that is secret saucy in some sense. Like that's a big differentiator for Google. Now, in fairness, Google is known to kind of ship.

01:12:58

Publications well after they've actually been incorporated and they, when they've probably moved on to the next paradigm. But still, this paper falls in that category. And it is a paper from Google DeepMind. allows for a bit of a phase transition in the, the way training is set up. So normally when you train a a distilled model, you start off by training it. In the kind of full resolution. I don't know, like FP 32 or FP 16. And then you quantize it.

01:13:24

You basically take that full resolution model and then you, you coarse grain, essentially all the weights, you lower the representation accuracy. and that doesn't work great because the model was never trained to perform At int eight, or, you know, like with eight bits of integer representation or four or two bits. And so you're kind of just taking, I mean, it's sort of like, you can imagine like taking a Picasso painting and then just pixelating it.

01:13:47

Like if Picasso had been painting with pixels, he probably would have chose made slightly different decisions as he painted that, that pixelated painting. so just taking his original full resolution picture and then pixelating it kind of like corrupts it a bit, you get worse performance. That's also why often you might distill into a student model that is quantized. That's one of the applications of distillation is for lower quantization levels, I think. Absolutely. Absolutely.

01:14:15

And yeah, exactly. In practice, that's how it works. but you're still kind of, yeah, you, so you're, you're, you're left with this challenge either way you do it. Your alternative. Yeah, is to kind of do that teacher student thing. If you do that, though, then you've got to, like, retrain separate 842 bit models and whatever else. Right? So you got to do that distillation process independently many times. So the question here is going to be, can we do one?

01:14:39

back prop and improve the models performance across all bit representations at once. That's what's going to happen here. And so they're essentially going to have to very kind of roughly sketch this out. You do Ford prop and they're going to, Log like representations at different bit resolutions, 84 and two, for example, they're going to calculate the loss from each of those and average it together.

01:15:03

So essentially, you're calculating the average performance of the model across all those representations. And when you do back prop, when you adjust the model weights, the parameter values, you're doing it to optimize for that averaged value. Which is kind of interesting. Like it sort of makes it so that you're forcing the model to be good at all those things at the same time. Weirdly, this leads to improve performance across the board.

01:15:23

And in particular, the, for example, into the two bit integer representation, versions of the model, after you do that, you can just do what you said. You can just pick out most significant digits and, Toss out the rest that'll give you your, lower bit with representation of the model. So basically cast aside everything, but the first two bits, and now you've got the two bit version of the model instead Cast aside everything but the first four bits of each parameter value.

01:15:48

Now you've got the four bit version of the model and so on. So it's, it's the same model, but you just like, you do some shit to it. That's very computationally cheap. Like you're just throwing out data and now you've got a more quantized version of it, but the model was trained at least to some degree to perform well at that quantization.

01:16:04

And the weird thing is that the, for example, the int two version of the model that you get out of this is better than a purpose trained int model and their hypothesis here is that there's something kind of regularization wise going on here that you're sort of forcing the model by forcing it to perform well across all those bit representations at once. You're forcing the model to have really robust and good representations.

01:16:29

of the input data internally that that are, yeah, that are anyway, more robust than otherwise, right? The model, if you train it at 16 bit, like can can sort of overfit in some sense to the 16 bit representation, get really, really, really good at that. But kind of leans on that represents your over optimizes that representation. Whereas there's some fundamental sense in which concepts that a model captures is.

01:16:53

Should kind of be independent to or pretty robust to the nature of the representation that the number of integers you use to represent them. So anyway really interesting interesting paper. They do get significantly better performance for into. It's like 10 percent more accurate than standard methods. I thought that was really cool and sort of counterintuitive.

01:17:12

worth acknowledging also, there's a bunch of research in the area, some of which we've covered so lots of previous findings on things like training your model to be able to distill it well, or sorry, quantize it well quantization aware training. Is a big topic, this is building on various ideas from there. But I do think the idea of like training your model at multiple resolutions at once, such as weights actually work well across resolution, this Matryoshka approach is, it's pretty cool.

01:17:47

Last story here is from Epoch. AI, which we often have covered in the last couple of months, we have a new bit of research on how much AI compute exists globally and how rapidly is it growing? So this is based on an estimation of. The number of shipped NVIDIA GPUs based on NVIDIA's reported revenue. So they, you know, of course, NVIDIA says how many chips they sell roughly. And there's some assumptions here based on the types of chips being sold, et cetera.

01:18:30

But, the final conclusion is that the stock of computing power from many NVIDIA chips is doubling every 10 months, meaning that the total installed NVIDIA computing power is doubling, you know, every, less than a year. And the total amount of compute globally available to do inference with has therefore been growing exponentially for the last few years.

01:18:58

We'll see how that, if that keeps going this one isn't including TPUs or other specialized AI accelerators by the way, because there's less data for it, we don't have a pattern of how much TPUs are deployed, but either way. Showing that as I guess we've known on a viable level, the amount of investment happening in acquiring compute has been growing like crazy over the last few years.

01:19:25

Yeah, one of the take homes here following the space is just like how incredibly consistent the exponentials are. Right. Like if you look at the curves here, they are smooth. They are robust. And we're looking here at the curves production of flops from one AI hardware design company. that's quite interesting, right? I mean, everything is exponential is all the way down. All, all the world's most important processes really fundamentally are.

01:19:50

And, and you know, and maybe that's an overstatement, but not by that much. And so one of the interesting things with this figure too, is they do show the relative. GPU computing power by GPU generation. So you're able to see what fraction of the total flops on the market right now come from, for example, the, the hopper series. So like the H 200, H 100 DGX, H 100 type systems, the Ampere series. So the a 100, for example, and then the Volta, the V 100, the Pascal and all that stuff.

01:20:20

So you're able to see how, like, as one generation of chip comes out, it gradually. Takes over the lion's share very quickly of flops on the market and becomes the main driver for that exponential growth. And so as of right now, Hopper GPUs already account for 75 percent of the flops on the market and the rest pretty much are just the the a one hundreds. So the a one hundreds, which right, that's what GPT four was trained on.

01:20:46

That's the kind of for, for a long time was really the GPU to care about now are basically an afterthought. And of course the H 100s and H 200s, that series is going to be phased out with the the black walls coming online as well, but all really kind of interesting interesting stuff. One number that they share, it's also sort of interesting. They estimate only about 7 percent of all installed computing power has depreciated due to hardware failures.

01:21:13

they say it could be as high as 27%, but their estimate is 7%. which suggests that, yeah, once on the market, that compute is good. It, you know, it tends not to, to degrade. And so you're more or less seeing just the, the raw number of GPUs impacting the results here.

⁠¶ Policy & Safety

01:21:31

Onto policy and safety. We begin on a summit that happened in Paris where a lot of, I guess, big AI figures happened this was meant to be Summit on AI safety, the news story we are covering is that both the U. S. and U. K. refused to sign the summit declaration on AI safety, and that declaration was about having inclusive and sustainable AI, meant that For everyone to come together, the declaration

01:22:05

was supported by 60 other countries, including France, China, India, Japan, and Canada, but not by the U S and the UK, there are a couple of other stories coming out of this summit as well. The U S was there, gave a speech saying that the U S is going to be a leader and criticizing Europe's excessive regulation and cautioning against China, so stuff like that. I guess a broad story is here is that, that the summit happened. There were a lot of representatives from various organizations.

01:22:41

We don't want to go too deep into it, but kind of follows on to previous summits that have been going on for the last couple of years. Yeah. And this is, one thing where like, I don't know why people choose to put language in this, like, like inclusive and sustainable. Right? So if you want to be divisive, you use language like that in, in this world, because obviously, as everyone knows, like inclusive is a politically loaded term. Now, it does not mean what is on the box, right?

01:23:10

Like inclusive is a very specific, you know, you're talking diversity, equity, and inclusion in that sense. And that obviously is not aligned with. The the sort of preferences and position of this administration. what I mean is the kind of politicized version of that. and there's a lot of controversy over whether that's actually a good thing as implemented at, you know, in practice.

01:23:29

And so when you include that in that kind of language in the declaration, like, You're just going to make it harder for some of the most important players, the US and UK to get on board with something like this. So it just seems like a pretty predictable failure of language drafting for something like this. And I missed opportunity to get a little bit of better alignment. on some actual concrete problems. Interesting speech, by the way, by J. D. Vance at the podium at the summit.

01:23:55

He kind of laid out to the administration's thinking on AI, at least their first crack at it right now, which is very much more focused on the opportunity than the kind of the risk set. He did end up, you know, closed by saying, look, the other, they're legitimate safety risks. This is not to say that. All safety risks are to be discarded, but focus matters. And this is such a good point.

01:24:15

And one of the challenges that some people have, called out with, for example, the Biden administration's like all inclusive executive order, famous executive order on AI, 110 pages, Where, you know, there's sort of something for everyone in there. There, there's stuff for, you know, kind of like labor law and rights, bias and ethics and, and all this stuff. And then there's stuff about WMD and, and compute threshold reporting for large scale training runs.

01:24:38

And, and that was one of the executive orders that was repealed. Sort of a, a sort of similar vibe here where they're saying, look, like We have to be able to focus. We're going to focus on the opportunities. There's obviously going to be some risks that come along, but that's not going to be the thing that we choose to emphasize, on where you, you fall on these issues, the, the important thing is you have enough focus, obviously for the WMD risk set, which obviously is, is very real.

01:25:01

And I don't think anybody's saying otherwise, but that was an interesting change in tone. and part of what, you know, makes you look at this language, like inclusive and sustainable and like. are you actually trying to do? You know, do you want to make a political statement or do you want to actually get countries to align on a policy? However you fall on that issue, it just seems like a bit of a missed opportunity there.

01:25:21

But certainly, by the way, France leaning into the kind of acceleration is camp with that announcement that came along with the summit of about 100 billion euros worth of infrastructure investment in AI, which It's a bigger move than certainly we've seen other countries make. And with France being the nuclear powerhouse that it is kind of makes them an interesting player in the, in the whole space. Next up, we have a piece of research deeply related to safety.

01:25:49

The title of the paper is utility engineering, analyzing and controlling emergent value system AI's. So there's the idea proposed in this paper that as you train a model, there are values that kind of arise within the model that happens perhaps without you even realizing it. And what they suggest is you can monitor and adjust the utility functions of the AI system to prevent the emergence of undesirable value systems.

01:26:27

And Jeremy, I'm sure you have done a deep dive on this one, so I'll let you get into the details. Yeah, I have like maybe more notes than I should. I thought it was a really interesting paper. It is another one by Dan Hendricks, who's put out a lot of interesting safety stuff. Circuit breakers, representation engineering, we've talked about on the podcast before. this is really interesting.

01:26:46

So take a language model and ask yourself, like, in a sense, Without anthropomorphizing, but what does this model care about? What does it value, right? Are there latently in this model consistent values? Will I find, for example, that it tends to value human life over artificial life? Will I tend, will I find that it tends to value a life of one nationality over, over another? Things like that. How would we dig that up? How would we demonstrate that that is a consistent pattern, right?

01:27:16

Especially given all the, the variation that comes with prompting, you know, subtle prompt changes can affect outputs a lot. So they set up what they call preference cycles. And I just want to introduce this idea really quickly. It's quite simple. So if you prefer option A over option B and you prefer option B over option C, then you should prefer option A over option C. Pretty, pretty straight forward. There are some cases when language models will break that cycle.

01:27:44

Basically they will express a, preference that violates transitivity, essentially this, the circular nature of that, that cycle loop. So a should be more, more valuable than C if. It's more valuable than B and B is more valuable than C. The models will sometimes tell you, no, no, no, actually I prefer C to A, for example, in that setting. They find those so called preference cycles get less common as models scale. So they drop below 1 percent for the largest LLMs.

01:28:10

In other words, the models get more and more consistent in terms of their stated preferences. Interesting. Interesting. Right. So you starting to get with scale, the emergence of maybe more kind of calcified or, or at least more well structured and defined preferences that are less are less incoherent or more in other words, coherent. Okay. So that's one little piece of data that they service.

01:28:33

The other is they're going to try to see how well a particular model of utility, Applies to language models. And so there's this notion of Thurstonian utility, and this is essentially a model of preferences where you assume that your utility that in other words, the value that you assign to a thing or an option or an object is going to be normally distributed. In other words, there's some noise. It's not just a fixed number. You don't just like you like lollipops 10 out of 10.

01:29:06

Like, that's not a thing. You generally like lollipops between like a nine and an 11, something like that. There's some spread, right? So, so at any given time, if I ask you how much do you like lollipops? you may tell me, you know, at 9. 8, you may say 0. 1, something like that.

01:29:22

But you know, there's some spread though, though it's clustered around some core values, some mean, and when you're trying to compare two options, like option X and option Y, basically you're going to look at like, the overlap between those spreads. And you know, sometimes just by chance your preference for lollipops will be higher than your preference for for sushi and vice versa.

01:29:46

If, those utilities overlap, if the distributions overlap, and if they don't, then quite consistently, you'll prefer one over the other, right? This is the model that they're going to use to assess. whether language models have consistent preferences, they allow for some uncertainty, some some noise in the system.

01:30:03

but fundamentally they're interested in seeing, you know, do coherent preferences emerge with scale in other words do these models tend to behave in ways that are Thurstonian in the sense They test this, they find in fact that they do, the larger the scale of the training run, the more Thurstonian they seem. So you're able to quite clearly resolve resolve preferences. They also use linear probes to predict the mean and spread and standard deviation of preferences for different things.

01:30:31

So they'll feed the model and input like you receive a kayak and they would see if they can predict based on the activations of, of the models neurons, like. What the mean and spread would be associated with with that. So essentially trying to probe at the underlying utility, not just the behavior, but the underlying utility that is implicitly being assigned by the model in the Thurstonian sense all these possibilities.

01:30:56

And so it suggests actually that yes, there are these like These utility like representations that are encoded in model activations, that that seems very clear. And they, anyway, they, they come up with ways to, like, try to steer the behavior of the models to affect those utilities rather than just the, the kind of stated outputs of the model. And anyway, the details get, Pretty detailed. But the one thing I'll highlight is some of these utilities are, are actually quite consistently not good.

01:31:27

they're quite consistently weird. So for example, GPT 4. 0 consistently would trade 10 us lives for one Japanese life. At least that's what its utility preference or utility values seem to indicate. you know, GPD 4. 0 also valued its own well being above that of a middle class American. It valued AI agents well being above some humans. And they also found that most language models clustered together in a political space.

01:31:52

So basically what they did was they, instantiated a simulation of Republicans and Democrats. So they had GPD 4. 0 do an impression of, you know, what would Elizabeth Warren do? What utility would she assign to a kayak to a you know, to an apple and so on and so forth. And then they would compare that to the base model without that prompting. And they found what they described as consistent left leaning biases and policy preferences. And they kind of map it out in a lower dimensional space.

01:32:21

You can actually visualize it on the chart. So really interesting. You know, kind of consistent with some of the complaints that we've, we've heard from various people on this stuff. Unclear where that comes from, obviously, because data is data training data is training data. So it doesn't have to be intentionally collected in any given way to lead to this sort of result. It's also really interesting.

01:32:42

You know, this methodology, it's unclear how, how closely this tracks reality, but it is an interesting indication nonetheless, and a great set of visualizations too. Right, exactly. And it is worth noting that this is sort of under the presumption that, if you're Prompting to see preference from a given model, the system prompt, you know, the internal details of how you're serving model can all affect it.

01:33:07

So these sort of aren't necessarily persistent, I guess, like, yes, they're built into the weights and with something like Llama you can be very clear with your system prompt, but when you're saying, you know, the values of GP4, Oh, whatever, I could easily tweak that by just changing a prompt and suddenly things are different. Nevertheless, it is interesting.

01:33:29

As you said, they show this I think on Twitter, this kind of got a lot of play figure 16, where they have a plot of which lives are valued more across countries. They also have. There is a right answer, by the way. Yeah. Yeah. The United States, apparently. And similarly, they also have a plot for specific individuals, including Joe Biden as kind of a neutral. Apparently Joe Biden matters much more than Vladimir Putin and Donald Trump. So, as you said.

01:33:59

Seems like a bit of a left leaning situation there. And, last bit I'll note is they also show a convergence of value system as you scale up across different models. Seemingly or their kind of guess here is that that's just based on training on the same data. And because you're training on the entire internet, right? It doesn't seem unlikely that you sort of converge to similar things.

01:34:26

And we've seen that also with other things like Representations of different models also converging as you scale up. So interesting to note sort of that pattern there. Next story is also related to Dan Hendricks, actually in a weird way. Dan Hendricks is an advisor of XAI. And so he was the one I think to post on X noting the release of the draft of XAI's management framework.

01:34:59

So Anthropic has their own very, very detailed I forget what they call it, but they have policies with regards to safety, RSPs and they talk about how they detect, you know, unsafe things, what they check for, well, now XAI has released publicly, or at least it's on their website. I don't know what they publicized it.

01:35:22

This draft document, eight pages going into what they'll be testing for benchmarking things like cyber warfare biological, ability to make chemical and biological weapons, WMDs, that kind of thing. They go into which benchmarks will go for what thresholds they will check for in terms of, it being dangerous and then various other things in this document. So. you know, in line with their commitment previously to releasing something like this.

01:35:54

And they do say that the actual document will be released in the coming months since this is just a draft. Yeah. It's an interesting combination of, of concreteness and transparency, even though it is a short document, it is kind of straight to the point, which. It's a bit refreshing if you're used to like reading these like very long policy documents. yeah, they do have a section on loss of control. Quite interesting. You know, all the usual stuff you'd expect in weaponization, right?

01:36:20

Looking at a very bio heavy and also some cyber. And the WMDP. Like weapon of mass destruction proxy benchmark, which kind of is a bit of a catch all they talk about the, some of the mitigations they'll use for weaponization. So refusal training, which everybody uses circuit breakers, right? That's Dan Hendricks is thing or at least he's put out the kind of big paper on that.

01:36:41

And then input and output filters again, something that everybody uses, but interesting to see circuit breakers explicitly listed there. The loss of control section. it's pretty short. They do say they'll be using benchmarks as well there. Obviously, as paper itself says and others have argued, you know, benchmarks for loss of control, not necessarily reliable because of, alignment faking, right?

01:37:02

Like, you expect beyond a certain level of capability, models to recognize when they are being tested and their early indications that that is actually a thing and to adjust their behavior accordingly, such that you actually might expect your benchmarks to look. Extra good. It'll really look like your models well aligned when the risk is highest.

01:37:20

And so a bit of a, bit of a challenge with using, using benchmarks here, but they do say our evaluation and mitigation plans for loss of control are not yet fully developed. And we intend to improve them in the future in this way. This is very similar to Anthropix, ASL four stuff and ASL five where they're like, look, we don't really have an answer yet. And frankly, I find that to be a lot more honest and transparent than opening eyes position, which seems to be yeah, like.

01:37:44

we're barreling towards this level of capability. We think we'll hit it soon. We're kind of maybe worried about loss of control as a thing. But I'm sure we'll figure it out when we get there has been more, at least the vibe since the high profile departures of basically all of their former super alignment talent. So, you know, actually seeing people come out and say, Hey, yeah, we, we don't really know what to do here.

01:38:07

it is obviously the only honest answer based on all the data that but it's also nice to see it explicitly laid out here. It makes it easier for, for people to kind of reason policy wise, they do list a bunch of other operational societal risks as they call them. Things they're going to do information security wise.

01:38:22

One thing that's really interesting is this focus on security, information security and implementing appropriate info sec security standards to prevent grok from being stolen by motivated non state actor, right? This is late stage A. S. I. Stuff, right? Like you start to think about what if China steals our A. G. I. What if whatever? That's a really, really important. They do this cool thing that I think all the other labs should do is to foster accountability.

01:38:49

We intend to designate risk owners to be assigned responsibility for proactively mitigating grox risks. For instance, our risk owner would be assigned for each of the following areas, WMD cyber, and loss of control. So an explicit point person on whose head, like the responsibility for that risk set falls explicitly. That's really important because otherwise you have diffusion of responsibility in the organization.

01:39:13

They also list like what, what we'll do if we learn of an imminent threat to the organization. And they cite X. A. I. Employees have whistleblower protections, enabling them to raise concerns to relevant government agencies. That's really good. Frankly, something that opening I does not have in practice. You know, they came out shortly after.

01:39:30

Actually, our first report was launched last year or sorry, just before with the thing saying, Oh, hey, we have this internal whistleblower hotline and then all these whistleblowers came forward without using the hotline tells you a lot about the level of confidence people over there have in their sincerity on that side. stuff. So anyway just really, really good to see the emphasis there. So lots of stuff going on here.

01:39:50

We'll see the longer version of the document, but as an initial draft, I mean, if you're gonna put something together like this, I think this is a pretty good start actually. And one more story. And as always, we've got to have a story about chip bans and export restrictions this time. It's about TSMC restricting sales to China as a result of us export sanctions. So the rules are that.

01:40:17

TSMC is not allowing design firms to order chips made with 16 nanometer and below processes unless they use government approved third party packaging houses. So below 69 we is basically like Top of the line chips, even, you know, 69 years, I presume is like super old and nothing you want to use will be using that. So this is starting end of January. So already in action and this will apply to Nvidia MD. Everyone works with TSMC. Yeah. The, and the idea here is, like.

01:40:58

If you can imagine some Chinese company ships to T-T-S-M-C. TSMC makes the dies. So you, we talked about this in our hardware episode. Basically the, like the GPU die, like the, the logic die, but then the logic dye needs to be packaged with. Memory usually from SK hynix or Samsung or something. So those things can sometimes get packaged, not by TSMC, but by other companies.

01:41:22

And so TSMC has to go, okay, I'm going to ship this logic die, which is kind of the crown jewel of the GPU in a sense to this other country, maybe in a different jurisdiction that I can't control where I just have to trust that the packaging. Plant that's going to combine that chip now with, the Samsung or the SK hynix memory stack or whatever and package it together.

01:41:42

I'm just going to trust that that packaging plant is going to follow whatever export control compliance stuff know your customer or whatever the U. S. government has asked me to follow. And in practice, that doesn't happen. It's, it's a big vector for getting chips into places where they shouldn't be. And so what the U. S. government is saying is look, we need essentially government approved packaging facilities to be the only ones who receive dyes from TSMC that are 16 nanometers or below.

01:42:10

It's consistent with some of the export control stuff that came out in the late Biden administration as well, which by the way, the Trump administration has kept in place kind of an interesting case of executive orders that have not yet at least been thrown out. So anyway. It's sort of interesting. It is the case that around as of 2023, China only contributed about 8 percent of TSMC's revenue.

01:42:31

and so, you know, this is not maybe the big hit that that it might sound like to TSMC having this extra constraint.

⁠¶ Listener requested topic

01:42:38

And actually I'll throw in the bit of listener questions here, since it's quite relevant, we had someone on Discord asking for your take on the export controls on H20 chips. So I, I don't know much about chips, but apparently H20s are good for reasoning the models due to having more memory than H100s. And yeah, he just wanders on your take as to that general idea. is changing. So, you know, new export controls come in that, that do cover the memory side as well.

01:43:12

And that was again, part of the last hurrah, the Biden administration there. So the H20s are no longer going to be shippable to China. And that's, you know, Really important. Because yeah, they actually are, especially as we go into the inference paradigm, they are more important. It's also like you just want to be robust to whatever the new paradigm ends up being.

01:43:29

So as long as you're shipping stuff where NVIDIA has room to optimize where they're allowed to use their five nanometer, their three nanometer, like the TSMC five and three and two nanometer processes. Yeah. You just don't want that in the PLA's hands. Like why, why would you, you know, why would you ship that stuff?

⁠¶ Synthetic Media & Art

01:43:48

And we almost done. We're going to do a quick lighting round for the last couple of stories. I'm just going to run through them in synthetic media and art. First up, we have Thompson Reuters has won the first major AI copyright case in the U S. This was a lawsuit from back in 2020, there was a firm named Ross Intelligence that was reproducing materials from a legal research firm related to them.

01:44:15

They lost the company Ross Intelligence, that company is now out of business anyway for a couple of years. But the important bit here is that the fair use clause, the kind of argument that there's fair use. For being able to replicate a material of Thomson Reuters was rejected by the judge, which could be have implications for the ongoing lawsuits against open AI and actual generative AI companies, which was, was not.

01:44:47

And the last story is that Scarlett Johansson has called for a deep fake ban after an AI video has gone viral. I don't know why, but Scarlett Johansson seems to be very involved in stories regarding synthetic media celebrities. In this case. There was a video posted on Instagram of various celebrities wearing a t shirt with a logo of like the middle finger raised with the Jewish star of David and the word Kanye as a response to, I guess, Kanye West doing some anti Semitic stuff.

01:45:24

Well that included scholar Johansson and other celebrities and Johansson made a statement You know, as well, she probably agrees with the idea is not great that she has been depicted in a post that has gone viral on Instagram, slightly viral and potentially could be misconstrued as a real thing that she was involved in as opposed to. Something just made up by some random guy who wanted to make us post.

⁠¶ Outro

01:45:55

And we got to finish with that. Thank you for listening and thank you for all the comments on discord. We haven't been able to get to too many on this episode. We got to finish up, but feel free to. Ask for a couple more topics or stories to discuss. We'll try to get to more of them next week. Thank you for listening. Thank you for subscribing.

01:46:14

Thank you for chatting on discord and more than anything, thank you for continuing to tune in as we apparently will never stop and keep going until AGI gets here.

Transcript source: Provided by creator in RSS feed: download file

#200 - ChatGPT Roadmap, Musk OpenAI Bid, Model Tampering

Episode description

Transcript