Here we go, talking about Gemini, don't you know? Gen 3 Alpha 2, so fun. Get ready for a wild ride. We'll make it fun, you see. From AI agents to Chevron deference with Glee. Hello and welcome to the latest episode of Last Week in AI, our podcast where you can hear us chat about what's going on with AI. And as usual, in this episode, we will summarize and discuss some of last week's most interesting AI news. And as always, you can also check out our Last Week in AI newsletter at lastweekin.
ai for articles we did not cover in this episode. I am one of your hosts, Andrey Kurenkov. For some context, I did a PhD in AI at Stanford that I finished last year, and I now work at a generative AI startup. And I'm your other host, Jeremy Harris. I'm the co founder and CEO of a company called Gladstone AI. We do And, you know, we've had a bunch of really useful feedback lately on the podcast on a number of different things. One of them was that apparently the sound coming from my end was shit.
And so we're going to try to fix that today, um, in a couple of different ways. I'm register. I'm recording locally, but I also move my microphone closer to my face. And so, uh, for those of you who are watching on YouTube, uh, you probably like that because it takes up a lot of the space that otherwise my face would in the shot, which is probably a good thing. Um, anyway, so, uh, gonna try to , get you guys some better sound quality this time around and hopefully, uh, a bit of a smoother ride.
Um, so yeah, I actually like it in frame because it makes it feel more like a professional podcast as opposed to what it is, which is like two guys full time jobs just doing this on the side with one of them, like fully a non audio engineer taking care of editing, but you know, we try our best to actually, uh, shout out to the person who Point to yourself, Jenga rocks on Apple podcasts, got us a three star review, which is not great, but very useful feedback.
Uh, and we will try our best to take it into account and, uh, do better. We'll see if, uh, that actually works. Plus we look more like Larry King this way. Um, which, you know, Plays into our ongoing strategy to gradually turn into Larry King, because that is. thing that people would want to do for some reason. Okay. News to me that that's the goal, but, uh, I guess we'll see if that works out. I thought we were completely aligned on that, but okay.
And real quick, just another shout out to another review. Uh, green poo, VFO podcasts, uh, has a review titled keep the doom, keep the doom. Okay. Yes. Uh, and, uh, yeah, it says keep with doomsday stuff, please. I'm not sure how this became a whole saga of I think there's a war playing out right now in our colony region. And yeah. Yeah. Yeah. So we'll try to appease everyone. Maybe a little bit of doom every now and then. Just a little bit of doom.
Yeah. Yep. And that's the news starting with tools and apps. The first story is Google opens up Gemini 1. 5 flash and pro with pro having 2 million tokens to the public. So this is in Google cloud and these two models are now publicly accessible with Gemini 1. 5 flash being a small multimodal model that, uh, is more for high frequency tasks and Gemini Bye. Pro is the most powerful version of their Gemini LLM, and notably has a 2 million context window, 2 million token context window.
That's the amount you can have in the input. And for some context, that is a lot. That's like 60, 000 lines of code over 1. 5 million words. So this is one of the big surprises to me over the past year. You know, if you remember in the early days of GP3, the inputs we could get are like 000, relatively small. And I felt that this would be one of the fundamental limitations of a transformer model. And it turned out to be not the case at all. So there you go.
It's a certainly an exciting day for developers. And, uh, these are some pretty capable models. So pretty nice to have. And they're really starting to compete on those those sort of secondary metrics, right? We all often talk about, like you said, context, window length, you talk about, you know, sometimes perplexity or whatever, but we're starting to see stuff at the engineering level really become the main point of comparison is inference gets more and more important.
So here in particular, you know, they'll tell you, yeah. Gemini 1. 5 flash. Um, is they say 40 percent faster than GPT 3. 5 turbo. That's at an input input of about 10, 000 characters, which seems pretty fair. Um, so, you know, that's basically saying, Hey, we have 40 percent lower latency. They also apparently have a, an input price per token price.
That's four times lower than a GPT 3. 5 with, um, with this thing called context caching enabled for These larger inputs, which is its own kind of interesting sub announcement in this whole thing. They're launching this, um, what they're called this context caching feature. Uh, as a public preview right now, this is both for flash and pro at Gemini 1. 5. This is basically a thing that allows models to store and reuse information. They've already.
Um, kind of processed without having to recompute everything from scratch every single time they get a request, right? So think about like your sort of short term memory. Um, I assume that they're using just kind of embeddings to achieve this. So it's like kind of a middle ground with rag. Um, anyway, so, so really interesting. Another cost saving strategy, keep the things in context to not have to reprocess them.
That all seems to be part of the, sort of the, um, turning this into more of an engineering, not more of an engineering discipline, I guess, but more of a user facing productized thing rather than a research artifact that we're seeing across the industry right now. Yeah. That's a very good call out. I think this caching feature is. More of a big deal than it might seem. To me, it's been a bit surprising that it hasn't already been the case.
Uh, the idea here is if you're using kind of fine tuning these LLMs via just a system prompt, you condition them to say, this is your task, blah, blah, blah, blah. If you just reuse that without any modification, it makes a lot of sense that you're able to cash with. And, uh, I actually looked around. There aren't, uh. except for this, uh, existing cloud models that allow you to do this. So probably pretty useful.
One other note that also is in the story, apparently there was a press briefing in which the Google cloud chief executive said that a lot of organizations, big ones like Accenture, Airbus, Box, Broadcom, et cetera, are building on the platform. And this is one thing also not to discount that, you know, they.
Google may have seemed to be behind OpenAI in developing Gemini, but they are a big company that is certainly ahead in terms of being able to provide enterprise offerings on its cloud platform. So it makes a lot of sense that these kind of big companies would work with Google, potentially Microsoft and Azure as well, but certainly Google does have an edge here that may be, uh, not apparent. Yeah, you know, they often talk about, uh, the importance of distribution, right?
As, as the central thing, the, the, the example I always remember from back when I, you know, we're doing the YC stuff. I think Microsoft Teams, uh, versus Slack is just such a great example, right? Slack launches, they get this like slow, steady increase in user traffic. I mean, it's very fast, right? This is a successful Silicon Valley company.
But, um, you know, Microsoft teams launches and they're like, their stick looks like a hockey, sorry, their graph looks like a hockey stick in comparison because they start with all this insane distribution. And, you know, you see open AI trying to kind of make up for that deficit by partnering with Apple, by partnering with Microsoft. Um, but ultimately Google has that baked in, right? So this is a structural advantage that they enjoy over open AI. Much more so. So kind of interesting.
Yeah. That, that this is something they don't have to worry quite so much about. It is an access they can use to, to differentiate on and just kind of like force users to start using their products essentially. And next of a big company, Meta is about to launch its biggest llama model yet, which is a llama free phone. 400 billion.
So when Llama 3 came out with the 8 billion and 70 billion parameter sizes, we also got a bit of a preview, uh, with the awareness that this 400 billion parameter exists, that it is GPT 4 kind of level. model that they were still training, which was kind of a funny note that I had. And they showed that on the major benchmarks, uh, it was matching GPT 4. So apparently it's nearing release now, a couple months later.
We don't know if it will actually be released and open sourced the same way as the smaller, uh, sizes have been, but it does seem like now it's being rolled out in some of the meta products such as on WhatsApp for beta users. It's now, there's an option to use the Llama3 405 billion model. Yeah. And these use cases, even these really beta ones do seem to have significant limits on usage volume.
Um, so, you know, just to kind of get people stoked about the upcoming release, this article is speculating July or August of this year, you know, we'll have to wait and see one of the, the, um, things I suspect that they're going to be struggling with here is making sure that the model can't be jailbroken as trivially as the other Lama three models can be, uh, you know, this has been a persistent issue for meta as they've in some cases been
embarrassed by, uh, Like congressional testimony, you know, Zuck got dragged in front of some committee or other. And I think, um, I'm trying to remember if Tristan Harris or Rio, who, who it was exactly, who sort of held up this jailbroken version of llama too, at the time. And, you know, Zuck was saying, ah, well, you know, you can't, can't do these bad things. And I said, well, look, we took your, your model and basically like trained it just a little bit and it forgot all of its safeguards.
And now we can, you know, get it to try at least to do some bad things, even if it couldn't successfully. Execute them due to a lack of capabilities. So I think this is going to be a big issue because when you start to think about models that are like llama three, 400 billion, right? This is a model, as you rightly said, that's been compared internally by meta to GPT four level performance. Well, you know, what are the things that we know GPT four can do it? There's things like.
autonomously discovering and exploiting, you know, one day vulnerabilities, zero day vulnerabilities, um, significantly leading a significant measurable lifts in cyber offense capabilities, even to some degree on a bioweapon design capabilities. And so we're already there. You start to open source those systems that can create some problems. So internally, I do wonder what that discussion sounds like.
I wonder what kinds of safeguards They're planning to implement because it's not even just like, Hey, let's implement safeguards that are robust to the jailbreaking techniques that we know of today. They have to be robust to jailbreaking techniques that have not yet been invented. And it's really not clear, you know, how you square that circle.
So I love to be a fly on the wall right now in the kind of, in the red teaming rooms at meta and the kind of decision rooms where they're thinking about, you know, the pros and cons of releasing that 400 billion parameter model. But if it comes out, it'll be a very, very big deal. Yeah, for sure. And I think there's really no way given what we know about jailbreaks, if you're able to train the model, you can't safeguard against it. It's just impossible.
So this is one of those questions that has been discussed a lot in open source is what's the benefit versus the risk of releasing models and enabling uh, actors to misuse them to nefarious ends. And so we'll have to see what consideration meta makes on this one. Uh, and you know, if it's released, maybe that's a doomsday scenario, kind of the conversation from there, but you know what? Maybe, maybe the 4 trillion, uh, predator. Yeah. The next one. Yeah, it's almost a 10x away from here, right?
That's what they, uh, they always say. Onto the lightning round. First story is Runway's Gen 3 Alpha AI video model is now available. So this was announced and previewed very recently. I think we covered the kind of announcement and some of our preview videos last week. They said that it would be available to users of their products very soon, and that has turned out to be the case. It is now available to paying users. So you have to pay 12 a month to be able to access a model.
You will now be able to generate videos up to 10 seconds and which generation speed will vary. Uh, and it's not quite solar level yet, but it, it is a big upgrade over gen two. And there aren't many players that have launched these sort of tools. We've seen Luma do that, but that was sort of a public preview. We don't have many customers yet. To my awareness. So, uh, runway has been a game for a long time. This is a pretty impressive model they've caught up in some sense.
So I wouldn't discount them even as Sora and things like that are coming out. And like you said, I mean, there are other competitors entering the space. You mentioned Luma stability, obviously Pika opening eye itself with Sora though that hasn't been released. So, you know, the space is heating up a little bit. They, they had been working pretty quickly back in the gen one gen two days runway had that is, um, you know, pumping out those models pretty quickly. Absolutely.
A few months between them, um, the, the big kind of gap that we've seen is really when all these extra players have come in. So I think that's, you know, now people are starting to ask those fundamental business questions, especially looking at the situation with stability, where this, you know, open source, everything. A story just did not seem to pan out, you know, Andre, I remember we were talking about that back in the day being like, we have, what is the business model here?
Like at some point you got to make some dollars and I get that you can do it in various interesting ways by hosting and all that stuff. But, but you know, at a certain point you need a competitive moat and, um, it looks like, you know, again, offering some users access to paid versions of these products does seem like it's an obvious default. Um, so we'll see if that becomes a norm more so on the kind of generative video and image stuff. But, uh, one thing that they do say is that.
That the, the model is, uh, currently runs with text to video, obviously, but in the coming days they'll be, uh, releasing other moats. So that'll include apparently image to video and video to video. That's gonna be really interesting, you know, like what new editing possibilities, uh, are accessible as well. Right now they are capped at 10 seconds of video generation, right? So, you know, SOA famously is, is a minute long, so this is 10 seconds.
Um, but then again, you know, so hard to compare it to Sora cause like we don't have access to it. It hasn't been released for general use. So we'll just have to see, but, um, definitely, you know, big step on the path to productization of text to video. This is a big day. One thing to note about Runway as someone who has played around with it and has kept track of a company, uh, it's worth knowing that they offer a lot of stuff.
stuff aside from the video generation, they offer background removal, uh, kind of optimization of video, up rezzing, all these things that as a professional in the industry. Most likely you would get a lot of benefit from beyond video generation, which is pretty unproven and is like a cool thing to fundraise and get money from investors for, but not necessarily something that people need at the moment. So, uh, I don't know. I don't know why I want to keep saying this, but I'm pretty sure.
Kind of a fan of Runway and I, uh, like to see them having these tools that seem like they would have a pretty significant impact. And then, uh, yeah, as you'll cover, they are hoping to fundraise quite a bit soon. And you, uh, use it at your current job, like the video game. I don't use it. I've played a round of it. I'm always curious with, with the, the work that you're doing there, you know, to know, well, like what tools are actually useful when you're doing generative AI for video games.
So yeah. That's it. That's cool. And next we are getting to Google as yet another big company and the news there is that there is some AI that's coming to their Pixel 9 smartphone. So this is basically a set of various features under a branding of Google AI.
ai things like add Me that ensures, uh, everyone is in a group photo studio, which is, uh, AI image generation pixel screenshot that ev you know, evidently will be similar to recall, where you can take a screenshot of a phone and talk to the AI about that. So these various built in things coming to Google phones in a way I think. Reminiscent of some of the announcements of Apple building stuff into iOS and we'll see how much people like it or if they just ignore it.
Yeah, and you know, this is google continuing to try to figure out how do we integrate everything under one under one hood here, right? So, uh, yeah, we'll we'll see if it's next Next, AI firm 11Labs sets Audio Reader packed with Judy Garland, James Dean, Bert Reynolds, and Laurence Olivier. 11Labs is the audio generation firm that has allowed people to Turn text to voice and recently they announced this reader app, which essentially provides narration for any text.
Now it seems that they have had these agreements to have the voices of these famous actresses and actors to do that. Uh, reading for you, they say this would be part of the iconic voices feature of the, so I don't know, interesting, you know, uh, apparently this hasn't been met with quite as much controversy. It was a pretty legit deal, a licensing deal. The daughter of Judy Garland has endorsed the deal. Uh, so. I think you as a user, right?
You want sort of a professional, a narrator rather than a robody voice. So certainly I could see why people might want to use this. Yeah, no, for sure. It's interesting.
This is almost like the soft underbelly of how to get, you know, automated, um, text to audio done without Without pissing people off, uh, while still really having the same effect that a lot of people are complaining about, you know, I think we talked about this in the context of, um, a couple, I guess, like almost two years ago now. So I, I wrote this book and I did the, uh, the audio book for it. And it, like, it was, Time consuming. You're sitting in the studio.
It's like 16 hours of painful recording. And if you don't do it as the author, someone else does. There's a whole industry of people who will read the kind of text of a book and, and, you know, do the audio. And people were saying, well, there was a scandal. I think at the time we were talking about a scandal where, um, you know, companies were selling access to those audio books, those, the sound that was being read by, uh, those on contract readers and saying, Hey, you can't use their.
Their own, you know, their own work, uh, to automate away their, their whole jobs. Well, this is a way around that, right? You go to people who, uh, aren't going to be producing any more work like that because they're already dead. You go to their estates. That's one good way to do it. Um, but it achieves roughly the same end.
So, you know, uh, maybe not as, as controversial, but again, achieving the same ends and, um, we'll see if this becomes a, a norm, you know, I, I think just given the cost of doing a lot of these recordings, the whole studio time, multiple employees, sound engineers, like this, this could really be a thing. Next perplexities pro search AI upgrade makes it better at math and research perplexities. Of course, the AI powered search engine that, uh, basically you can use, uh, input a query.
It does search. Online and looks up various websites and it provides an answer informed by that. So they have now upgraded the pro search feature, which Would be the thing that you have to pay for and there's a couple things in there They uh now have an updated pages feature, which we covered a little while ago And here they say that uh, it's going to be better able to understand complex queries, plan responses, and synthesize in depth answers outside of that pages feature.
Uh, and, uh, interesting. Yeah. Perplexity is one of the, you know, pretty buzzed about, uh, companies. I think Jetson Huang said he uses it daily. I've played around with it and I would say it seems quite useful. So it makes sense for them to keep kind of. Pressing the advantage we have in this space.
It's also the kind of thing that is so much easier for perplexity to pull off because they don't have all the baggage of the sort of legacy, uh, use cases that a company like Google, which nominally they frame themselves up as competing directly with, uh, has. And so they're able to just sort of be. Excuse me, be more, um, through open horizon about what, what their product possibilities are. Um, just looking at the demo that they've got, you know, it, yeah, it looks pretty compelling.
They type in a search query, like I want to see the Northern lights. When is the best time to go? And what are the top viewing locations in Iceland or Finland? And yeah, click on the pro search thing. And it's Automatically goes through like, like any, you know, good web agent would to, uh, to dig up the answer to your question in a way that's a little bit richer and more detailed, uh, than necessarily you would get through a standard search process.
So, yeah, we'll, we'll see if this is actually the use case people want. One of the things that does, you know, If you just jump to mind when you start to introduce this kind of use case is again, inference becomes so, or sorry, uh, inference latency becomes so much more important, right? You're just watching as this thing kind of thinks about your question. And as I imagine myself as a user, right? My, my u like my user experience is that of like.
Just hanging on while this thing chugs away, uh, that may or may not be what I want, but, uh, as you start to see the, you know, the efficiency of inference, the cost of inference and the time of it go down, uh, you know, these things might become more and more attractive. So perplexity is definitely positioning itself through this to, uh, do some, some really interesting things.
And just FYI, if you recall just now, if you remember the silly things that Google's AI responses did recommending people eat rocks, well, people did also check out what perplexity does there. And surprise, surprise, perplexity did not recommend bacteria trucks. It actually recognized that it was silly and pointed out that there are some sarcastic responses. So, yeah, it, it is a pretty polished product in this category.
And last story for the section, Gemini's data analyzing abilities aren't as good as Google claims. These are some studies that showed that for large data sets, the correctness rate is fairly low. Apparently they give the right answer only 40 to 50 percent of the time. The tests here are basically evaluating true false statements about fiction books. And for these large Portions of a book which has implicit information that is not stated in the text. Apparently, these models aren't that great.
And this is one of the kind of questions that are still out there for these huge context windows. How good are they in practice with giant inputs? I don't think we know the answer yet, really. And the studies do shed some light on that. Yeah, there's some important framing here too, which is that it's not here that, you know, Google is behind the curve or they're, they're, you know, they're doing things, um, badly on a technical level.
The complaint here really is about what Google's marketing department has been doing, um, basically over promising, under delivering. I mean, Google's. Obviously delivering excellent products. Gemini 1. 5 pro 1. 5 flash. These are really impressive models. Uh, the challenge is that, you know, when you go out there and say, Hey, we, we can actually solve these very complex data analytics problems, you know, maybe not, not quite the case or cherry pick things.
It's something we've seen with Google before, right? We've covered stories where Google has come out with an impressive demo that doesn't quite seem to be. It kind of live up to the hype. Um, that's maybe a bit of a, an internal issue for them. The other right hand, not knowing what the left hand is doing perhaps, or the marketing department just kind of getting ahead of itself.
Um, one of the things that does come across in these, in these studies, it's kind of interesting, you know, the, uh, the old needle in a haystack.
eval, which is the way very often these models ability to like, um, recall information out of a really, really large body of text is tested, um, is to insert a very specific piece of information, like one sentence that is like out of place somewhere in a really long piece of text and testing to see if the model can actually recover that sentence correctly. And what they're pointing out here is.
That they notice, and this is a quote from the article, we've noticed that the models have more difficulty verifying claims that require considering larger portions of the book, or even the entire book, compared to claims that can be solved by retrieving sentence level evidence. So when the information you're trying to recover is more diffuse, when you have to put a lot of things together, all of a sudden your accuracy goes down a lot. It's not a pure recall test.
It's a recall plus synthesis test. And I think that's a really thing that highlights, you know, a kind of eval that we haven't seen done, certainly not nearly enough of when we look at these big models in long context. Yeah. And I think it makes sense.
We've seen, uh, kind of in very earlier days of large context windows on fropics show that you could do really well on these sorts of tests by just appending, you know, present the most relevant sentence, uh, in the prompt for these needle and the haystack. Benchmarks and it has been sort of criticized that yes, it's a needle in the haystack, but it's a very obvious needle that really stands out.
So it, uh, yeah, these benchmarks, I think, haven't really shown how useful these context windows will be in practice. And this sort of research is very informative. Onto applications and business. The first story is that Quora's chatbot platform Poe allows users to download paywalled articles on demand.
So you can download HTML files for articles published by paywalled journalistic outlets that include includes things like the New York times, Bloomberg business week, Vietnantic, and all of these other big services. We saw. Also, the claim being made that Perplexity does this for Forbes, where they look at the article and can produce kind of verbatim responses without really citing, citing the source, but not that much.
And in this case, it also appears to be a case about the bot does not appear to adhere to the robots exclusion protocol, which is a web standard for prevent spots from accessing parts of websites.
So really interesting developments in this front where the internet and any publisher of information such as, uh, journalists and things like Reddit, Twitter are now really having to consider, uh, blocking downloads of data, which we've had paywalling for users, but it has been fairly easy to get around those paywalls if you tried at all. And I think that's going to change very much. Yeah, absolutely.
And it's really interesting to see the legal arguments that are being made to say like, Hey, this isn't really plagiarism. You know, there's this one quote that they pull out, um, from the article and they say like, because they made a copy on their own server, that's prima facie copyright infringement. That is going to be the claim that Uh, the sort of, uh, the, the newspaper organizations put out the flag, the core disputes, this comparing PO to a cloud storage service.
So which one is it right in, in practice, it seems like you really can just use these systems to like, yeah, get a downloadable PDF to me, naively, you know, from a user experience standpoint. Yeah, you know, it feels, it feels an awful lot like, uh, it feels an awful lot like plagiarism.
But, um, you know, they say also, uh, to your point, right, that according to the web server's logs, that, that, that for the websites that, um, essentially are hosting these news stories, immediately after the assistant bots were prompted to summarize, Uh, text on the site. They found a bot identifying itself as Quorabot that visited the site. Apparently, as you say, it did not attempt to visit the site's robots. txt page.
So that robot standard, Andre, that you just mentioned, um, it's kind of worth flagging. I think we might have talked about it before, but on a lot of websites, if you go to like, you know, whatever. com slash robots. txt, you'll see there like a list of bots that are actually excluded by name. So you'll see like, you know, Quorabot explicitly. Uh, elicit or whatever.
And that tells you, okay, this page cannot be, uh, cannot be, um, uh, sort of, um, scraped by the core bot or by whatever the bots are that are there. Uh, the suggestion here is that essentially that protocol is being completely ignored. And, you know, especially at a time when we're starting to figure out like, hey, do people by default have a right to not have their websites scraped? Um, Or do they have to put it on the robots. txt? Or will companies ignore that?
And does the law have to intercede? Do we need new laws? Uh, that's a, a kind of interesting exacerbating point. If you notice in practice, people are ignoring these sort of like goodwill, um, requests implicitly through the, the robots. txt file. So, uh, yeah, really interesting.
The core argument here is the file attachments on Poe are created at the direction of users Transcribed And operate similarly to cloud storage services, read it later services and web clipper products, which we believe are all consistent with copyright law. So that's, uh, anyway, a whole bunch of, of interesting stuff there just to say, if you're the one who is on a one off basis, you know, clicking something and that's what generates the summary they're claiming.
Oh, well that's, you know, then that's not, um, that's not, uh, a sort of a copyright infringement. Uh, so anyway, kind of interesting. I don't know that I buy it personally, but Hey, Again, we got lawyers to listen to the podcast, so let us know if you think this is a good argument. Yeah, all these things are so much up in the air, but it does make me think of these stories about Quora, the perplexity. It really starts to make sense that OpenAI has been doing all these media partnerships.
Uh, over the last couple of years, really aggressively pursuing, paying different outlets millions of dollars to train on their data, but also to include their data in the outputs to refer to them, cite them, right. And if it does come out that for paywalled, uh, websites, you cannot read the website to, uh, generate a response, then this could be one of the big competitive advantages of open AI. Yeah. Right.
As of now, it's a little bit, uh, uh, interchangeable using GPT 4 versus Gemini versus They're a little bit different, but not that much. But once these tools can be differentiated by partnerships. They do become quite a bit different and depending on your needs, it may make a lot more sense to use open AI rather than anything else. So yeah, like developing a sort of paradigm that I don't think we really expected or thought about last year.
Yep, and now OpenAI is signing with Time Magazine, so there you go. Yeah, and on to the lightning round, where we'll try to be a bit quicker. First, hardware story Huawei and Wushan Xinxin to develop high bandwidth memory chips amid US restrictions.
This is a report reportedly planning to develop these high bandwidth memory chips, and as we cover, All the time, uh, China does need some means of producing different types of chips to actually compete in the AI front, uh, GPUs, but also these high bandwidth memory chips. And so it makes a lot of sense. Huawei has been one of the leaders in providing. AI hardware that NVIDIA is still the leader in. Now, these companies have denied this report. They said it's a rumor, but, uh, who knows?
Well, yeah, I, I suspect it's, I suspect the rumor is true on this one. But, uh, yeah, they, they've got a really clear reason to deny that. Um, they're basically fearing more US, uh, sanctions, right? So Huawei, uh, had the, the hammer come down on them when they announced to the world that You know, Hey, we're, we, we partnered basically with SMIC to create a set of sanctions, busting a seven enemy or technology, um, a couple of months ago. And then, you know, the U S came down hard.
They're expecting that to happen again, if they're not careful about this sort of thing. Um, one thing just worth, worth flagging. The one thing I'll say about this is, uh, from a technical standpoint, this is about high bandwidth memory, right? H B M really critical technology distinct. From, uh, what you might think of as logic, the sort of like logic chips that are, sorry, the, the logic, uh, on the chip that, uh, TSMC, for example, really specializes in.
So the logic piece is the piece that runs the calculations. Um, it's usually really, really close to anyway, this thing called SRAM on the chip, really, really small amount of memory. That's just for like working memory. The high bandwidth memory is what you need in order to move the massive amounts of data and all the model weights and the gradients and the. Activations that you need to run through your system while you're training or doing inference at scale.
So it's more of a kind of pipes thing than a number crunching thing. Uh, but both are really crucial. And we talk a lot about TSMC on the podcast. Uh, they certainly are the leader on the logic side. When you look at high bandwidth memory, the kind of big players there are more like SK hynix, um, and Samsung. So each of those, uh, They control about 50 percent of the market share in this space.
So this is really China saying, look, we have Huawei SMIC working together to figure out the logic piece. And now essentially we've got other players as well on the high bandwidth memory piece. And that's really what, uh, is, is up to here. And, you know, that's going to, that's going to off the U S government if, if, uh, they make any, any big breakthroughs here, but it's early days for this category of tech. They, they very much are behind here.
And next story also on a Chinese company, this time Alibaba, and the story is that its large language model has topped global ranking of AI developer platform, uh, or on developer platform Hugging Face. So on Hugging Face, that's one of the sources for ranking AI models, and now three of the top 10 ranked Chinese LLMs of Hugging Face. Uh, have these Quen series models, Quen 72B, Quen 1. 5, 110B in third and tenth places, respectively.
The other one on there is Yee 1. 5 free 34B chat from the startup, Uh, 01. ai founded by the notable figure, Lee Kai Fu, that was, uh, ranked seventh. So we see now competition coming around in the space of models in China, similar to Anthropic and OpenAI and Google out in the U S and it's pretty exciting. pretty clear that as with a lot of technology, there's going to be essentially two independent ecosystems that will rise up.
And, uh, it's interesting, like there aren't many secrets in the tech involved in making large language models. So it's basically a race of how much compute can you throw at it? And Alibaba, Uh, Tencent, various companies have a lot of compute similar to Meta and Google. Yeah. Yeah. And you know, the secrets to the extent they exist, you know, there are some important ones, but they tend to be at the very, very frontier. And so they'll take some, some time to diffuse out.
Um, certainly Alibaba is not at the true frontier and like a lot of Chinese companies, they are incredibly constrained by the scarcity of advanced technology. Chips, right? They don't have access to the best GPUs. We got these kind of crappy, uh, sort of one and a half generations behind a type chips that we're starting to see, uh, float around and increasing supply there. So that's part of the issue here. It is also part of the reason why so many Chinese companies are focusing.
On the open source angle, right? There's a lot of reasons why you might want to do that. You know, one of which is you can imagine wanting to train in certain biases in your open source models as a way of exerting a kind of geopolitical leverage over your adversaries.
Or if you know, American companies, for example, are going to use whatever the best open source model is, Hey, you know, maybe make it talk a little less about the Tiananmen Square Massacre or how bad the CCP might be treating the Uyghur Muslims or, you know, whatever else. Um, so this is all kind of part of, of the stuff that's baked in.
The other thing is a kind of legal front that's been open here where we've, you know, talked about this before, especially with, uh, 0. 1 dot AI, where, you know, their, their terms of use say, Hey, if you're going to use our language model and there's some legal issue that arises from it, that legal issue is going to be adjudicated. And the People's Republic of China. So that's another way that they can get some leverage over what's going on over here.
But certainly Alibaba has made a big push in the direction of open sourcing. Again, like so many other Chinese companies, a disproportionate interest in this because if you're, you know, you want to be number one at something and they can't be number one at the true frontier AI game. Um, then, you know, the, the main thing that that's left to you is, yeah, play the open source game. At least win that for now.
Try to use that to get some headlines, you know, maybe, maybe pull some researchers over to the extent that you can and, uh, and go from there. So I think that's kind of where this is all coming from that leaderboard, by the way on hugging face. I just went over yesterday to have a look and, uh, yeah, I mean, it's, you know, it's when number one meta. So it's like llama three 70 B number two. And then.
In the number three spot, there's Quen again, 72 billion with Quen 2, the, uh, well, it's the, the non instruct version. Um, anyway, so you see it is quite well populated by those Chinese models, which are not, not super surprising, but definitely noteworthy. Oh, and yeah, worth noting that these are the openly released models on the Hugging Face platform. So you're not going to have things like GPT 4 on there. Next up here comes, uh, meta Ray Bans challenger with chadgpd4o and a camera.
So meta Ray Bans are glasses that come equipped with speakers and glasses. And as of this year, meta AI. So you're able to say, Hey meta, and do all the usual things you can do. We have AI, you can say, you know, what is in front of me. You can ask it questions and it will answer. And it's sort of been a little underrated or I would say, you know, you've seen things like rabbit are one humane pin that have been huge flops in the space of wearable AI.
These glasses came out, this is the second generation of these kinds of glasses from Meta and the reviews are actually pretty positive, partially because there's a lot of use for them outside of AI. You can just record videos from the glasses. And so now there is this complete solos. a smart glasses manufacturer, which is aiming to release, uh, similar glasses that will have QP4O capabilities with a camera. You can ask it questions.
And evidently you will also be able to integrate with Google Gemini and Anthropix AI models. So there you go. It'll be interesting if this turns out to be the wearable paradigm that people actually want, as opposed to these other, let's say pretty large failures. Ouch. Yeah. Actually, it's funny. I didn't know not to do real or anything, but I was watching a YouTube video from this channel coffee Zilla. And I'm not sure if you follow them, but you've seen it.
Yeah. Yeah. So they, they did a thing on the, um, the rabbit R1 and the rabbit team basically, and how they were, I had no idea. They were like a crypto scan basically, as far as you could tell before they went into the space. So maybe, um, Yeah, maybe just like a hype chasing situation. But yeah, this, uh, this product looks pretty interesting. It's, um, also it seems to be competitive with the Ray Bans in terms of pricing.
Maybe, you know, we know that the current version, which does not actually have, um, uh, the video functionality in it, it's just audio only, uh, is, uh, retailing for basically 250 bucks. So it's presumably going to be more than that. Presumably not going to be that much more than that. So hovering around maybe that 300 price point that the Ray Bans currently sit at. And I will say just so it's, uh, out of the open, I am an owner of the Meta Ray Ban glasses. I'm a bit of an early adopter.
I'm a fan of Meta VR. I'm a fan of You don't need to flex. No need to flex, Andre. I'm just saying if I'm being very positive about these kinds of glasses, you know why I think. I do have a personal opinion that this is a sort of wearable People may actually want, unlike all these other things, what do you, what's the most common use that you put them to? What's the killer feature for you?
The big one is, uh, the video recording and taking photos just because it means that you don't need to take out your phone. And if you're being a tourist or art as an event, that's a pretty nice feature. You could, you don't need to be distracted by looking at your phone. You can just keep looking at whatever you're looking.
This AI integration isn't, I think, yet kind of a main benefit, but being able to take photos via voice commands and being able to ask questions, sort of, instead of going to Tragic Betrayal website for really quick things, you can just use your voice. I do think, arguably, that could be very useful. Where is the audio? you get, you get a response. No, you get a visual, a video response, right? And the led not, not an audio response to your, you get an audio response.
They are built in speakers in the like frames of that use bone conduction technology. And so you can play music on them to listen to books and so on. Are they, does the audio go into your ear or do other people hear it as well? Or how does that work? It's, uh, yeah, using this. Fancy technology where it's, you can sort of hear it, uh, but for the most part, yeah. Oh, interesting. Okay. All right. Well, you, you heard it here at first 300 bucks folks.
Uh, I'm sure Andre, do you have like a, a promo link that you can, uh, well, you know, unfortunately we're not sponsored by them. So, uh, too bad, you know, but, uh, anyways, this is a little, just a bit of personal take. We're not advertising here. That's great. And next up, a bit of a spicy story that came out, uh, Apple's Phil Schiller is reportedly joining OpenAI's board. This will be, uh, an appointment with an observation component, so not kind of directing OpenAI's board.
Actions, and this person is the app store chief and the former marketing head. So they go, you know, the deeper partnership between Apple and opening eye continues to get deeper. It does. The other thing, too, that's sort of interesting about this and that jumped out at me when I saw this was, uh, boy, this is going to be awkward. You've got the Microsoft Board Observer on the board. You've now got the Apple Board Observer, like the Microsoft Apple rivalry, two people on the same board.
By the way, uh, Apple not known. For taking board seats on prominent companies very often. They have done it a couple times. Um, I think there's a Chinese guy who's, I think it might've been DD that they, uh, they, they did that briefly on, but not a standard practice for Apple at all. This will give them some visibility into the machinations of what's going on at the board board level and, and how are, how's company business being handled basically.
So, you know, that's pretty strategic information. It's just about all they get from it because it is, you know, just like Microsoft observer role. Um, but, you know, some intel it's going to come out of that you could imagine situations where the Microsoft observer might request that, uh, that the Apple observer leave, for example, if they're going to talk about the sort of Microsoft opening. I partnership in a way that strategic.
Those sorts of requests, as I understand it, are often granted when they are made, but that is a courtesy. Um, so, anyway, kind of interesting, uh, kind of interesting times over at OpenAI. It just reminds you of how insane it is that Sam Altman's been able to make these deals between Microsoft and Apple. I mean, this is like the original OG tech rivalry. So, uh, so there you go.
Yeah, it really makes me wonder if in some sense, Microsoft isn't entirely opposed to this in a sense that we have talked about how there were antitrust investigations on Microsoft and OpenAI in Europe and in the US. So similar to how in the nineties, Microsoft kind of propped up Apple in a kind of funny way and made antitrust considerations against them. Well, maybe they do want OpenAI to partner with other companies.
On the other hand, it does make me wonder, uh, Microsoft apparently paid and invested in OpenAI for this exclusive license to GPT technologies, which was always pretty vague and unclear on what that means. While the exclusivity there apparently is not, uh, holding up. So, very curious situation. So that, okay, that's a great point, number one.
Number two, something it also brings to mind, right, is OpenAI's board is, so, so the voting part of the board, not the observers, but the voting part of the board, is charged with determining when officially OpenAI can be said to have achieved AGI. That matters because when they do, that requirement on the part of OpenAI to share their technology with Microsoft doesn't matter. And OpenAI is not required to share access to technology that's AGI and above.
And so, you know, you can imagine if you have Apple now, even as an observer, maybe influencing conversations on the board, trying to steer things towards say, recognizing earlier on that they have hit AGI or some similar threshold. That could be an interesting way to throw a wrench in the, in the, in the gears anyway. So we'll have to see. No, you're totally right. This is such a, it's such a weird situation. We're just completely off the edge of the map here.
So, uh, we'll see what Phil Schiller does. Yeah. You know, if, if there are any antitrust experts, uh, you feel free to chime in. I think it's very interesting to think on whatever that might be a possibility. And the last story for AI video startup runway is looking to raise 450 million. dollars and that would put the valuation of a company at four billion dollars. This is following up on previous raising in 2023.
They raised 141 million from investors such as Google and NVIDIA, and we're valued at 1. 5 billion. So yeah, you could say that things are heating up quite a bit for Runway and, uh, we'll be interested to see if they actually do raise that amount and include increase their valuation by that much. Onto projects and open source. First section, Qtai open sources Moshi, a real time and native multimodal foundation AI model that can listen and speak.
So this was a pretty big thing on Twitter, I think, where this model Moshi is pretty much what GPT 4 0 demonstrated, a real time multimodal model that you can talk to in near real time. The model was fine tuned with 100, 000 aural style synthetic conversations and has an end to end latency of 200 milliseconds. And so yeah, people are pretty excited to have a GPT 4 O type model.
I will say that looking outside the demo at some videos posted on Twitter, people trying it out, not very impressive, but still people are definitely working on these kinds of models. Yeah, I feel like that's happening an awful lot these days.
We're definitely, we've crossed the uncanny valley in demo land, but beyond that, it sort of like reminds me of, um, I'm trying to remember what the name of that company was, but start with a V. Anyway, um, you know, back in the day there was like this, uh, how this is going to crush me. Anyway, there's, there was some company that was really good at doing demos of AI stuff and it just never, never came, uh, came together. And now it's really bothering me. It doesn't matter. Doesn't matter.
I'm not bothered by this at all. Um, so this is a, yeah, 7 billion parameter model, uh, which, you know, surprisingly small, especially for a multimodal model. You know, that's, that's probably where they're getting the low latency from too. It's not that much to kind of propagate through it inference time. Um, but if it's also, you know, not delivering in quality, then that's, you know, maybe a more fundamental issue, but, uh, yeah, the diffusion of the stuff recedes a pace.
Look, if this model actually worked. Um, you'd have some really interesting questions about, uh, responsible use, right? You got a system here that can do the text to speech stuff, can do all the GPD 4. 0 stuff, again, if it actually could, which it kind of can, but really kind of can't. Um, and, uh, you know, we ought to be asking ourselves, like, what, what does this imply about, you know, scaled, uh, not just disinformation, but identity theft and all the usual things.
If, you know, phone calls, AI generated phone calls, or just an API, uh, call away, then that's, uh, you know. Probably a bit of a shift in terms of that risk landscape too. But, uh, but anyway, I'm not at all bothered by the fact that I can't remember the name of that company that starts with a V now. And, uh, and that's all I got. That's a good thing. I guess, uh, sometimes you got to just let things go.
Yeah. Yeah. Like that company whose name starts with a V. Anyway, and next we have a paper mm eval pro calibrating multi modal benchmarks towards cross theory and efficient evaluation. This is a partnership of a few organizations and they show that on multi modal benchmarks with multiple choice. questions. Often, the numbers are a little bit wrong. So you can use biases in the questions to answer or guess the answer pretty well without even using the image.
There are sorts of things that happen often with benchmarks. You know, you realize after release that the way the models are getting good performance are not really the way you want them to. So in this case, Updated benchmark, there are 2, 138 question triplets, uh, two thirds of those were manually labeled by human experts. And some of them, uh, some of the other ones are from existing benchmarks.
And as a result, this benchmark is more challenging with the best LLMs lagging behind human performance by 31 percent. percent compared to eight percent in previous benchmarks. And that's, I think what more and more of these new benchmarks are seeking, you know, actual challenge where VLLMs aren't at human level. Yeah, it's, it's so interesting, you know, not at all, frankly, how I would have expected the space to play out like, you know, 10 years ago, five years ago.
Um, you know, you're looking at it now. It's like the, we seem to be smashing, um, you know, human level performance across such a wide range of, of tasks.
And then the ones that were not, um, so often it kind of feels like this Uh, you know, this famous like good hearts law problem where the minute that you identify a metric and make it a target that you're going to try to optimize for all of a sudden, yeah, you'll find ways to move that, that metric that don't actually touch the kind of original underlying cause of that metric being so low.
So you know, you can kind of overfit, let's say to a particular performance metric and uh, and not actually address the substance of what it used to be getting at. You can hack that particular metric. So you know, this is kind of highlighting that that is absolutely an issue, especially in the multimodal domain. Maybe a bit less surprising because that domain is a bit newer.
Um, but yeah, their big focus here is on the sort of like type one error, uh, to kind of abuse my statistic language, a bit of a, like kind of false positive type error where you incorrectly conclude that, yeah, your model's really good. Um, when in fact, what's going on is it's, you know, using cues in the question that kind of don't require you to look at the actual multimodal part of the problem at the image, for example. Um, you know, it's, it's the.
If just the way that from the way the questions asked, there's a little bit of information that makes you a little bit more likely to answer the question without even needing to look at the, um, at the image that it's supposed to be coupled to, then that can be a thing. And I think that's going to be an ongoing challenge here, but, uh, but yeah, another, a new benchmark that shows surprising, uh, uh, The surprising gap between humans and AI systems is always good.
Uh, it's an area for, for improvement for sure. Next story also on model evaluation. And this one is about how Anthropic is pushing for third party AI model evaluations. So they had a blog post titled a new initiative for developing third party model evaluations where they asked three key areas of focus for evaluation development, AI safety level assessments. Advanced capability and safety metrics and infrastructures, tools, and methods for developing evaluations.
And so the third party aspect of this is that we are seeking proposals. They have an application form for, I suppose, anyone to, uh, submit a proposal for evaluation. They also have a new email address. If you have questions, you can email eval initiative at anthropic. org. Calm. So, uh, yeah, it makes a lot of sense. I think this is part of the conversations and safety needing, uh, independent, uh, overview kind of initiative. And this seems to be in mind of that.
Yeah. I was, um, at some conference a couple of weeks ago with, uh, I think, you know, Jack Clark gave some, some, um, address about this. And this is basically core to anthropics. Now, big governance play. So one of the things that they're imagining is this future where you have these things called regulatory markets, where basically you have the big companies building out their AI systems that could potentially be dangerous.
And then you have this ecosystem of smaller companies that are tasked with performing audits and other things like that for those bigger companies. And those smaller companies. Anthropic one day hopes, or at least Jack Clark one day hopes will have some sort of government mandate or support they'll be, you know, getting, getting certification from those private companies will be required in order for those, those big companies to release models and maybe do some development work as well.
Um, you know, I, I think they're. Some, some significant issues with this, uh, can be in terms of how far you could go with things like misalignment risk with things like, um, C burn, like chem, bio, radiological and nuclear risks, just because, you know, you need so many companies to be performing these audits, uh, with clearances and, and, you know, accessing very, very sensitive data, but you know, doesn't, doesn't mean it can't work.
Um, and this is certainly them firing the, the initial shot here, really trying to drum up this ecosystem. They've done a great job of that. They've. You know, brought on a whole bunch of, uh, great AI, uh, evals companies to audit their models in the past. So they're really trying to create, stimulate that demand at the, at the ground floor, um, which you know, is a, it's just a good thing, right? You want more companies that can do more of these audits. Um, I will say one thing.
Looking at their AI safety level assessments section, right? This is the thing that, you know, one of the three key areas that they, they want to source, uh, contractors for, it was kind of interesting to see them break down the, the, the risks essentially that they want to look at their, look at their models for, you know, they call out. Your standard, you know, cyber attack, cyber offense risk, chem, bio, radiological, and nuclear risk. This is sometimes known as C burn risk.
So we've seen that call that before autonomy as well. These are like, you know, can the models self replicate or, uh, you know, gather resources on its own, the so called survival and flourishing evals that we've talked about before. And, uh, and then there's social manipulation and then there's misalignment risk. And they, They kind of have misalignment risk as its own category, independent of autonomy, which I think is both interesting and important.
Uh, that's I think the right call, but kind of interesting that they're separating, right? The ability to self replicate to kind of escape if it, you know, if it wanted to from the actual, uh, impetus, the desire to do that or the incentive to do that. So, uh, yeah, I thought it was a really, uh, interesting breakdown. Anthropic is, is good at this. They do a good job of kind of laying out there in a very structured way. They're thinking on a lot of these risks sets.
So, uh, yeah, great, uh, great read and we'll, we'll see what comes out of this. Yeah, hopefully what comes out is no AI doomsday. But then what would we talk about Andre? And one last story for this section is about Mozilla Llamafile. And this is coming out of the AI Engineers World's Fair, which just happened in the Bay Area.
So the announcement of this Llamafile, if this is an open source project and Llamafile's are these files that package together the weights of an LLM with the software needed to run it. With the idea being that with these files, it'll be much easier to deploy models on device and in various platforms. Mozilla, of course, is a developer of Firefox. They are pretty big in the open source space. So this is an open source initiative.
to power local AI and Mozilla also unveiled this Mozilla builders project, which is an accelerator that offers 100, 000 in funding for open source projects that advance promise potential of local AI. So many companies are trying to become a player in the open source space and interesting contribution here from Mozilla. It's also, you know, Mozilla, obviously a big contributor to the open source game. Uh, that's really become their big differentiator, especially in the last couple of years.
Um, but they've, you know, they've always been into it and, you know, Hey, the Firefox browser, there you go. Uh, big open source contribution. So now more and more so on the AI side. Um, one of the, uh, the interesting things about this too, is that llama files are actually going to intelligently utilize, as they put it, they could have said use. I could have said you, just, just saying, but they, they utilize, they're going to intelligently utilize available hardware.
Um, so for example, they can use a GPU if available for faster performance or your CPU if it's not ready. So that's kind of interesting that they've done a lot of like optimization and, and there's a bit of meta awareness of the, the hardware ecosystem around this thing. It's a single executable file. So just basically, you know, double click if you will, and, and, uh, start it running. So yeah, really cool and consistent with their, uh, their wider approach. Thank you.
Onto research and advancements. First up, we have researchers append AI status quo by eliminating matrix multiplication in LLMs. So matrix multiplication is basically most of what neural nets and large networks do. do they take some weights, some inputs, multiply them a bunch of times, and that produces the outputs. So this coming from university is offering a way to, uh, get around that by using ternary values, a fancy term for just using negative one zero and one.
instead of floating point numbers, things like 1. 1, 1. 2, and so on. And they have implemented this matmul free linear gated recurrent units that can perform basic arithmetic operations, and so on. So with the approach, the big benefit is, of course, reducing memory usage and potentially also increasing. And, uh, they do have some experiments on scaling up models up to 2. 7 billion parameters showing that they are comparable in performance while using much less memory and having better latency.
Yet another example we've seen. Uh, one bit models also, uh, published, uh, this appears to be quite similar. And so we have to wonder, you've seen a couple of papers generate some discussion, uh, but, uh, it still has yet to be proven at larger scales. I think that's, so that's exactly right. Especially on the sort of speculative nature of this. It is exciting.
Um, so just for, for context, essentially what's happening here is instead of doing matrix multiplication algebra, which is, What usually has to go into making a transformer do what it does, what they're doing is something that that is something like addition, that they're kind of swapping out multiplication for addition. Intuitively, you can see how that would work, right? Like when you're just dealing with three possible values, negative one, zero and one.
Um, the, the kind of, um, operation space is a lot more constrained. And so multiplication and addition, uh, start to kind of allow you to reach just about as far in, in, in that, that kind of value space. Anyway, bottom line is addition is also a lot easier, right? It's a lot faster to do. And so by stripping out the The multiplication side, you're able to kind of run these things much more efficiently.
And they actually do this on their own FPGA, basically like custom hardware, because ultimately everything is so optimized. All the hardware that exists is so optimized for transformer architectures that you can't really, like, if you come out with a new crazy good idea for, you know, these kinds of architectures, You're probably going to find that it underperforms a lot just because it's so, um, handicapped basically by the fact that hardware is so optimized for the current paradigm.
Um, one, one little note here, and there was a blog post, uh, that I was looking at on, um, I think it was less wrong that went into this a little bit. I just looked up the title and this thing popped up. So they highlight a couple of important caveats to this, right? So one of them is. The, the curves that they use.
So, so just, uh, maybe for a little bit of context, the key thing that they're doing here is they're demonstrating that the scaling curves that you get for this architecture actually seem to look better than the scaling curves that you get for transformers. So traditional transformers, when we talk about scaling curves, basically we're talking about how. Low is the loss. How well does the model perform as you pour more compute into it, as you put more flops into the training process?
And what they're showing here is that that curve for this, uh, sort of, um, uh, addition strategy instead of multiplication, the addition, um, architecture they're using here, that curve is actually steeper. So as you increase the amount of flops that you poured your model, supposedly you're, you're getting, um, kind of more, more returns that would be a giant deal.
And these These curves actually intersect at 10 to the 23 flops or so, which is actually the scale of a lot of open, uh, current open source models. It's not a huge, huge scale. So the argument here would be, it's already better to use this kind of architecture, assuming that the hardware is there for it. Uh, the challenge though, is that they're drawing that conclusion based on three points on this curve. They're really extrapolating from three points.
Given how these points look, I'm just looking at the raw data here. You could very credibly make the argument that this is actually, it's a, yeah, it's, it could just be noise basically. So you, you would, to your point, Andre, you would want to see this done at more scale. You just want to see more points on that scaling curve plotted out, do more training runs and, uh, and verify that this is actually the, the trend that you expect to hold.
Cause you know, right now this could just, It could be, frankly, a bit of a nothing burger if it just turns out that, hey, uh, when you actually redo it, you find that the traditional transformer, uh, either matches or, or outperforms this architecture.
But, but still, I think it's, it's interesting and noteworthy that such a simple architecture that just does addition, like this fundamental assumption that I think it's fair to say we all would have made that multiplication was pretty core to what transformers are up to and the magic that happens under the hood. That, doesn't seem to be the case. So, you know, it makes you wonder what other low hanging fruit might there be.
And, and that's probably a pretty big field of possibilities at this point. Exactly. And we've seen this a lot with quantization, right? Where it's pretty well known now that you don't need all the numbers. You can reduce it to three bits, four bits, et cetera. Let's see if we can take it down to, uh, what, two bits, I guess, in this case. I'm just happy that they didn't call the paper addition is all you need. Cause that would have Yeah, that'd be a little, uh, obnoxious.
And speaking of a popular paper name, uh, formulas, the next one is AI agents that matter, uh, with, I've seen, uh, maybe one or two times a few years ago, there was deep reinforcement learning that matters, that called out the evaluation practices of reinforcement learning and basically pointed out that they suck, that they don't really compare things well because you know, you don't control for things like seed values and randomness and, uh, things like that.
That's essentially what this paper is doing for agent evaluation. And the big thing they point out is that on evaluation, you cannot just report accuracy. You need to, uh, Produce the accuracy while accounting for cost in the sense of that with agents, if you just run more of them or run them for longer, you are likely to get better results. So we need standardized benchmarking practices that allow for actual information, reproducibility.
And, uh, there's a few more other things that, uh, they do analysis of failure modes, for instance, overfitting on a single task, meaning that they won't generalize, uh, things like that. So I think a very good point, uh, for the space, uh, agents are kind of, you could say the next frontier that are seeing a lot of, uh, work and a lot of, uh, investment and so on. We are not quite there yet, but A lot of people think that this is the next thing to crack.
And this is one of those things that showcase that we are still a little bit early on on the road to that. Yeah. I really liked this paper. It was, it was a great example of kind of questioning some of the, the basic assumptions that nobody, you're always surprised when you see papers like this. Cause you're like, wait, nobody, nobody asked this question, but then again, it's not like I did. So, so there you go. Um, yeah, no, they, they do a great job.
They, they set up a couple of, different baselines that they compare to each other. Um, so for example, they, they show you like, okay, GPD four models, just like out of the box, zero shot, no agent architecture. How do they perform? Um, then they have a version they call retry where basically you just repeatedly call. The model, um, up to, up to five times in this case and try to get it to solve a problem and see how well does that do? Um, then they have a version that's kind of similar.
They call it warming same idea, but you just turn up the, the temperature of the model with each run. The temperature is this parameter that kind of controls the creativity of the model. So how often will it predict, uh, will it generate the word that is maybe not the most likely next word it thinks are the most optimal next word. But, you know, one of the kind of from the wings the distribution a little bit more swing for the fences. Um, so that's that warming example.
And then there's this thing called escalation where they start with a cheap model, uh, in this case, llama three, 8 billion parameters, and they escalate to a more expensive one. So say GPT 3. 5 or something like that, if they encounter a test case failure. So those are the, the sort of four paradigms, pretty basic approaches, not really your, your agent like models that you would expect. And they draw a couple of really, to me, interesting conclusions.
Um, You know, one of them is that state of the art agent architectures, at least for a human eval, they don't outperform these simple baselines. They find there's no significant accuracy difference between their, in this case, their worming strategy. So pretty out of the box, uh, kind of benchmark baseline and the best performing agent architecture, at least for this eval.
So human eval is a, essentially a coding dataset where you have Um, kind of coding tests that the model has to pass and um, yeah, like there's no measurable increase in performance with these fancy architectures. Furthermore, a lot of these fancy architectures suck in terms of costs. So not only are you not getting better performance relative to these, you know, fairly vanilla, you know, Um, essentially like prompting engineering strategies, but you're also getting outrageously higher costs.
They find that techniques like reflection and this thing called, uh, LMDBs, sorry, LMB. So this is the language model debugger. They cost 50 percent more than the just vanilla warming strategy. And another one called language agent tree search, which we've podcast before cost over 50 times more. Well, yielding basically the same results.
So then they, they end up showing this curve basically of the cost of these agents on, on the, the X axis and the accuracy, basically the performance on the Y axis, and they trace out the sort of, uh, well, the Pareto frontiers, it's called, right? So this is like the frontier of what's optimal. And from there, you can really see how some of these architectures just really fail to, to perform as well as. You know, the escalation or the warming, um, baselines, which are again, like not fancy.
They're, they're really basic. And so this really calls into question, um, they're going to argue, uh, whether it even makes sense to try with these very expensive, very complex architectures when, you know, pretty simple prompting strategies, uh, naive strategies do the trick. Yeah, and that is something we've seen in many cases, or at least I've seen over the years with deep learning and machine learning.
Every once in a while you have this kind of paper that comes out and for a given domain, in this case agents, but it's also happened for video evaluation, for question answering in NLP. A lot of different things, it basically says, look, it turns out that these numbers are mainly due to how we are doing the benchmarking. Maybe the benchmark itself, uh, for instance, like you can guess the result about looking at the video or the image.
And then they say, well, here's a very simple thing that turns out to work way better than all these other things. And so, yeah, very much a useful thing. that comes out every once in a while to point out sort of the mythological flaws of the community as a whole. At the lighting round, we start with a bit of a nerdy paper. This is such a nerdy paper. It's really quite nerdy, I would say, but we are going to cover it nonetheless.
It's called the Warp on the Benefits of Weight Average Rewarded Policies. So if I understand correctly, it's tackling this problem. of when you do a reinforcement learning human feedback on models to align them with what humans do, you're starting with a trained model already with a bunch of knowledge baked into it. And so if you train it some more to make it do certain things that you want it to do, As a result, it may, uh, lose some of that stuff that it had prior to this additional training.
It can forget pre trained knowledge. And so usually when you do this stuff, there is a technique called KL regularization, which essentially makes it so the models stick. pretty close to what they were prior to the RL, and this paper has this weight average rewarded policies strategy that shows that you can get away a little bit from, uh, vanilla KL organization. You can do some fancy stuff to, uh, improve the ability to optimize in the RL stage while still retaining information. Yeah, exactly.
That's it. So this, this is basically from the question, the problem of like, if you, so in reinforcement learning from human feedback, usually what you do is you create this thing called a reward model that you train from a lot of interactions with humans, a lot of upvotes and downvotes of different content. That reward model is naively, you would think, okay, well, I can just train my language model to get a high reward. high predicted reward, uh, to generate outputs.
Let's say that that reward model would predict would score. Well, um, the problem is that you end up with these perverse optimizations where the model will then find ways to make that reward model, get a really high reward or offer a really high reward, but that are basically just hacks. Like it finds the, the. Through the breaking points, the ways in which that reward model maybe doesn't fully capture what you want, um, because how could it human desires and values so complex.
And so, yeah, what you do is you say, okay, well look, head a little bit in the direction of what that reward model wants, but don't change too much, right? We want to make sure that, you know, you, you don't, um, you don't change your behavior too much.
And so you include this, essentially this anchor, uh, as it's sometimes referred to this, uh, callback labeler, uh, divergence, Um, kind of score that says, hold on, you're, you're starting to behave a little too differently from your original behavior, so let's kind of not go off too far in that direction. And the perverse optimization doesn't happen.
This is a technique that is going to say, okay, for any given fine tuning run like that, um, what we'll find is if we take the same, if we, let's say, make five different copies of our language model and we run it through this process. Five independent times where, you know, we'll, we'll train it on our reward model and with that, you know, whatever that callback label, label, labeler thing, so it doesn't go off into some random direction.
We'll do the same thing with the second copy and third copy. These copies are going to learn to get high scores in slightly different ways, right? They'll learn to get high scores for that reward model in slightly different, different ways. And so they'll all learn kind of slightly different things about what humans do. actually want that was encoded in that reward model.
And so what you then do is you train a bunch of these in that way, and you're going to merge them together by, actually it's a dead simple technique, basically it's, it's like averaging their, their weights together. The, the parameter values, um, it's a little bit more fancy than that, but anyway, that's roughly the idea. Um, and then they are going to essentially, There's a fancy way of doing this, but roughly they take that, uh, that new model they get and they're going to repeat the process.
Okay. Use that as the new language model, create five copies of it again, run through this process where each copy gets trained independently against the reward model and the callback labeler divergence. It doesn't go too crazy and then merge them again. And so they go through these kinds of independent cycles, uh, to gradually get a more and more Aligned model, but that doesn't go off the rails, and the results are, um, anyway, are pretty interesting.
It's actually, again, one of these Pareto frontier things, because by controlling the number of times they go through that loop, they can more finely control how much do you orient the model towards what the reward model wants. Towards your, your assessment of human preferences versus how much do you get it to remain as it was to kind of, you know, orient towards that COBAL labeler score so that it doesn't change too much from what it originally contained the knowledge it originally contained.
Um, anyway, so that's, that's basically the overview. It's super nerdy paper. Sounds super nerdy, but also worth noting. This is coming from. DeepMind. And so while it seems theoretical, it could well have significant impacts in practice, since of course, the stage of, uh, RL from human feedback is essential to developing a good chatbot. Next up, scaling synthetic data creation with 1 billion personas.
The offers here introduce this thing called Persona Hub, which is a collection of 1 billion diverse personas of LLMs. Uh, they QA'd this from web data and these are meant to be sort of different, uh, mimicking of different possible. Ways of thinking of knowledge, uh, these kinds of things. And the purpose of it is to do this synthetic data generation, uh, generating data that doesn't exist via AI. And then being able to use that for, uh, trading models.
So, uh, yeah, I like this paper because 1 billion is a big number. Personas. Sounds cool. And persona hub sounds cool. Uh, and also because synthetic data is one of these things that might be essential moving forward, it is probably already something that when you look at GPT 4 or Uh, Claude, that is part of a training recipe.
And so this thing of, uh, having, uh, different variations of your AI that ensures that the synthetic data is actually useful, that produces very data, this could be an approach towards that. All right, cool. So, so we know that one, one billion is Andre's big number threshold. That's good to know. Um, so this is, yeah, this is actually a really interesting paper as well, because I think it highlights how counter intuitively hard it is to generate a bunch of synthetic data.
Like this is not something that, you know, I think most people would think, well, you know, you got GPT 4, right? Just, Ask it a bunch of questions and you'll get your, you'll get your data out. But of course you're limited then by how many different ways you can phrase the question. Uh, because if you ask the same question, you know, identically twice, you're going to get the same answer.
If the temperature is down to zero modulo, uh, some reasons why you actually wouldn't, but that are not terribly relevant here. Um, so this is a bit of an issue. The solution that they came up with was like, Hey, wait a minute. Why don't we just, um, if we can create a bunch of descriptions of personas, a gigantic number, let's say a billion, so it'll impress Andre, um, then we can actually take those descriptions of personas and append them to a bunch of questions.
So, you know, for, think about like, um, okay, explain how a rocket works. Okay. You know, you might get one output for that. Okay. Now explain how a rocket works for a, um, you know, a software engineer, mother of four in Oklahoma. Okay. Now explain how a rocket works for a doctor in blah, blah, blah. Right? So depending on the persona that you tailor it to, you're going to get a different output that.
Okay. Kind of shares, maybe a different aspect of the knowledge that was encoded in the module in the model that you're prompting, and that's important, because really, when we're talking about generating synthetic data, we're talking about trying to elicit as much as possible all of the information that is contained Latently in the model that we're prompting. And the only way to do that is to poke it from every possible direction we could possibly imagine poking it from.
And personas are identified here as the way to do that. And the big question is how do you actually generate the personas? And essentially the, the technique they're going to use here, the text to persona strategy that we'll use is to give a model, a piece of text. that they pull from the internet and ask the model to describe the kind of person who would read, write, like, dislike, whatever that text, right?
So you're going to get from that a description of a persona and you can get a giant number of them because you're querying a model. And then you can also, so one of the challenges that you run into with that is you might not actually find that all personas that are possible are actually represented there because the text on the internet is disproportionately written by certain kinds of people.
And so they add onto that a persona to persona stage where they derive personas with inter based on the kind of interpersonal relationships they would expect to exist from the first personas. So basically figure out like who are the kinds of people who would work with or interact with the personas that you previously identified. So then you end up with this gigantic set of personas.
You trim them down using various techniques to make sure that, you know, cause you'll have a ton of overlap, a ton of redundant personas. Um, they do that, they end up with over a billion personas. Worth flagging, this is a paper by Tencent AI Lab Seattle, that is a Chinese company for all intents and purposes. And so, um, it notable that they write, uh, we are open, sorry, we are open to releasing more data.
They release, I think, uh, I think it was about, uh, 200, 000 of these personas openly, but they say we are open to releasing more data when we can better assess the potential risks and concerns, which will be discussed in later in detail. Um, this is a, it's basically a CCP lab, but okay. Um, at least I should say Tencent is Tencent AI lab, Seattle, uh, you know, may have a, let's say a complicated relationship with Tencent, but it, it certainly makes you think.
Yeah, and, uh, some examples of personas, uh, an enthusiastic amateur golfer who is passionate about combining sports with philanthropic causes, uh, and also a senior software engineer who encourages the undergraduate to consider the social impact of their algorithms, uh, some of the shorter ones they have in the first page, a chemical kinetics program. researcher, a moving company driver. So that gives you some of a taste.
Uh, there you go, that they use those personas to create a bunch of data, uh, for different, uh, interactions. And yeah, there you go. I think a bit of a less nerdy paper because, uh, personas are fun to think about. And, and there is, there is actually kind of a big headliner here, which they kind of buried this lead a bit, but they do this with a, like an open source 7 billion parameter LLM. It's a Chinese model. It's quen two, right? Seven B. Um, they actually, so they run this hard.
They do the, um, they're sort of fine tuning on synthetic data that they generate. That's kind of like math related. And they actually end up with a performance on the math benchmark, uh, that beats Claude three. Opus. So like, I don't know, a couple of weeks ago, this was the leading, um, the leading anthropic model. This is the big model, right? It's, it's like a Gemini ultra. It's the largest scale version of the quad three family. It beats this model. It's a 7 billion parameter model.
It beats that model. Um, uh, it basically on NGPD for turbo, by the way, the preview version. inaccuracy on the math benchmark. So, you know, looking at one particular benchmark, not always, uh, doesn't always give you the complete picture, but it is noteworthy that when it comes to just, you can do this by cranking out synthetic data at scale. If you have these personas, this technique, at least so far seems to be real, seems to work quite well.
And one last paper found in the middle, Calibrating Positional Attention Bias Improves Long Context Utilization. This will be the a little bit mid dirty paper. Mid dirty, okay. Yeah, yeah. So, uh, they look at the problem of when you have long inputs to your model, There is this problem of if the thing you care about is in the middle, then, uh, LLMs typically are not as good at using it. They're better at looking at the beginning and the end of the input.
And it seems like this is a problem known as the lost in the middle problem. And one big reason for it is because of an intrinsic attention bias where tokens at the beginning and end of input receive higher attention regardless of the relevance just due to how you compute attention and positional embeddings.
So to fix that, the researchers propose a calibration mechanism found in the middle that fixes that bias and as a result, uh, makes it so you can deal with stuff that is in the middle of your prompt. Yeah. And then the fixes, the calibration is surprisingly straightforward.
Uh, roughly speaking, what they end up doing is coming up with like a, a dummy, um, like a, yeah, dummy document that they're going So put at various positions in this, um, in the context window and see what is the attention score on that dummy document, uh, on average, right? The, the average attention score across all the tokens in that document.
And that's going to allow them to say, okay, well, you know, um, if it's an irrelevant dummy document, uh, that variation, we can just add that attention score rather, sorry, we can subtract that attention score, um, from whatever we get. from our actual, uh, kind of document at that position in the context window. And that should give us the kind of calibrated version that doesn't have this positional bias.
So again, like one of these cases where it's, it's so simple, you find yourself reading these papers and it's like, how has nobody thought of this? But of course there's just so much going on. And the reality is a lot of these techniques don't work. Don't need to be optimized much to work, um, be it because they're, you know, operating at crazy scales or everything else is already so optimized or whatever else. So yeah.
And interesting that we're still learning such fundamental facts about how the, uh, the, you know, kind of context window works and attention values are computed. And moving on to policy and safety, the first story highlights, uh, has a bunch of stuff in the link, but, uh, one main thing, which is with Chevron's demise, AI regulation seems dead in the water. So this is in the US.
Recently, the Supreme Court has struck down the Chevron deference, which is a policy that allowed federal agencies to interpret congressional laws. And that is now struck down, meaning that nationwide AI regulation and generally a lot of policies coming from the executive branch of a government, uh, are less, uh, powerful. And this story comments on that. That means that courts will have to exercise their own legal judgment.
And, uh, regulation will need to go through, uh, be able to survive specific legal challenges. And given that these are pretty technical subjects, uh, we'll See if the courts are able to deal with this chevron deference is a it's a really big deal. Actually, it sounds it sounds boring It's actually not it's kind of an interesting concept. But but it's super super important.
I mean, this is One of the most important probably the most important policy story of the last year And and maybe at the next year too, depending on how things go in Congress, but essentially yeah the idea here is Any time right now, there's a statute, a legal statute that's ambiguous, right? So if you think about, let me take a step back. You know, it's July 5th. We've just had Independence Day. We're doing a bit of a civics lesson here. It's not going to hurt a bit.
It'll be very short, but I think it's important to kind of frame this up. The U. S. government. Um, is set up with three different branches, right? You have the executive, that's the president, basically, and the, the agencies, the departments of, of government. Um, you've got the judiciary, which is the judges. They basically interpret law and kind of fill in the gaps. And then you have the legislative, which is basically Congress.
They pass the laws, then, you know, bind the executive branch that determine what the executive can do and, and then get interpreted, um, by the uh, judiciary. Okay, those three branches need to be balanced together to ensure separation of powers. That is a fundamental virtue of the American system. No one branch of government is allowed to accumulate too much power. It's a way of preventing tyranny and all kinds of bad outcomes. Good stuff so far. Now, Congress passes a law.
That law is going to contain a crap ton of ambiguities, right? There's no possible way that legislators are going to be able to anticipate every possible contingency, every nook and cranny and possibility space. And so you need somebody to fill in those gaps. You need somebody to say, well, you know, in this particular situation, it's an edge case. You know, the famous saying is hard cases make bad law.
Like you don't want to, you know, it can be really, really difficult to figure out how to apply the law in certain cases. The question is, who is actually going to fill in that gap? Until recently, when it came to issues regarding sort of technology regulatory oversight, the government would defer to the regulatory agencies.
So the argument would be, you know, when it comes to matters of finance, if you're looking at financial instruments, if you're looking at stocks, bonds, you know, derivatives, the SEC, for example, just has a ton of expertise. So why don't we just let the SEC interpret what's going on? What these laws actually mean and then, you know, apply them because they're sort of like the technocratic experts on this. Um, the, the idea is that judges do not have that expertise, right?
Judges are usually like basically just lawyers who are dressed up in different suits that like pass judgment and, and, uh, and interpret the law, but not in these kinds of highly technocratic technical cases. Well, what Chevron deference did was exactly that. It allowed agencies to be deferred to in the implementation of these rules of regulation. That's been struck down. What this means in practice is we are now going to rely on the interpretation of the judiciary, of judges.
And judges are not technical experts, which means you want to get rid, basically you're forced to get rid of as much ambiguity in the law as possible. Congress needs to really be careful. If they're going to pass AI legislation, they got to make sure that legislation anticipates all the possible ambiguities that could arise. That, frankly. is very difficult to do. That is a giant additional hurdle for this whole process. I don't think it's one that can't be overcome.
We've done a lot of work on specifically this question, but I think Chevron deference is a giant, giant challenge or it's the fact that it's been kind of repealed or, um, sort of like struck down in a sense, uh, is a big, big challenge. The reality is that, you know, AI is such a tech, it's such a technically nuanced topic. The way scaling works, the way evals work, the way liability ought to work, all these things play into interpreting the law.
And, you know, you can't expect judges to have that kind of understanding, uh, very quickly. So this is a, you know, it's a big issue. Also, you can imagine if Congress is forced to resolve issues down to insanely high levels of detail in order to basically like, As Axios and Axios reporters said, predict the future. That's basically what they're being asked to do is like predict every possible way that these laws could be interpreted.
Um, if you're going to ask congressmen to do that, well, you're going to have an awful lot more of these petty squabbles that are going to arise because now you're having to have all these arguments at a higher level of resolution and detail. And frankly, the system was not set up with that expectation. In fact, historically, It was kind of understood that when Congress writes ambiguous laws, it's implicitly delegating authority to agencies to fill in those gaps.
Now what we're basically saying is going forward, so this is not a retrodictive thing, it doesn't apply to past laws, but going forward, yeah, Congress is going to have to do this. They're going to have to get down to the level of resolution in detail that regulatory agencies used to, used to have to result down to. Um, so this is a, I think a pretty big blow to the U. S. 's nimbleness, right?
It needs the dexterity to respond to things faster than the legislative cycle can allow, especially in AI. So this is a really, really bad development in my estimation. I think in practice, like I don't know about the interpretations of the constitution that went into this. I am not a lawyer, certainly not a judge. Um, But in practice, the effects of this are very, very harmful.
Um, it does come the same week as well as the corner post ruling came out, which removed basically a, a six year limit on the rights of affected parties to challenge past regulations. So it's, it's part of a kind of wave of deregulation, which normally I'm all for, but in the context of AI, it, you know, stuff is moving so fast. You need to be able to defer to a fast moving kind of executive cycle time rather than relying on, you know, judges and, and that sort of process.
There you go, that's your civics lesson, and uh, certainly I've learned a little bit, so thank you Jeremy for being the expert on this one. Next up, NVIDIA to make 12 billion from AI chips in China this year, despite U. S. So, uh, these 12 billion will come from those specialized chips that are, uh, the H20 chips. It sounds like there'll be deliveries of a ton of them. It seems that NVIDIA will deliver more than 1 million of these chips, and they cost between 12, 000 and 13, 000. dollars.
And that's the suggestion of 12 billion dollars in sales. So yeah, there you go. Uh, despite the controls, it's still the case and NVIDIA is doing a lot of business in China, but these chips are about one eighth as powerful as the, uh, very much leading chips that NVIDIA is selling in the U. S. So, uh, The export controls are having some of the intended effect, even though they're not fully shutting down Nvidia's ability to do business in China.
Yeah, I think this is a really interesting kind of, it's a story about the lagging tail of Nvidia's business in China. Interestingly, by the way, Nvidia is actually doing more business in China than they were this time last year. Um, but it's, it's as a fraction of their, I mean, they, you know, their income's grown so much, their sales have grown so much that, uh, as a percentage of their business, you know, it's, it's dropped by quite a bit.
Um, you know, a couple of interesting take homes from this. First off, you know, the H20 GPU, you can think of it as like NVIDIA takes an H100 GPU, so like a regular top of the line GPU, and they basically fry some of the circuits and ship it. That's basically how these things are made. Um, the same is true of a lot of the other kind of derivative GPUs that are meant for the China market.
Uh, one of the things that, um, so there's a, an interesting quote from, from, uh, from Uh, Dylan Patel, who is, uh, one of the guys who runs semi analysis really, really, really recommend this newsletter if you're interested in AI hardware. Uh, so he said, uh, that the, although the H20's capabilities on paper are actually worse than their sort of domestic competitor. This is the Huawei 910B chip, talked about that quite a bit.
That's the chip that's based on the seven nanometer process that SMIC has. Um, so on paper, the H20 should be a little bit below that in practice, it's actually a decent bit ahead. Bye bye. Especially it's, it is got better memory performance. So the way in which the, the kind of area on that chip is allocated is just a little bit more optimal than the, uh, the nine, 10 B. The design of the chip is just better, uh, which, yeah, maybe not too surprising.
NVIDIA's damn good at this and, and Huawei. While they're getting up to up to speed in their game, they're not quite there yet. Um, so the expectation here is that this is going to start to, to change over time. NVIDIA's H20 GPU is going to fail to continue to compete with Huawei's latest and best. And, uh, and look, those export control constraints that the U. S. government's put in, the Department of Commerce, uh, I don't expect those to, to get like more permissive.
So I think over time, probably what's going to happen is Huawei is going to gobble up more and more of the domestic Chinese market as they and SMIC and other companies like Shenzhen, um, kind of pick up their game and allow fully domesticated, uh, GPU markets to pop up. Not too, not too surprising. It's kind of an obvious, um, uh, result of the export controls.
Um, and it was expected in like, as far as I can tell internally, USG, a lot of the people I talked to, you know, this is where it was going to go, but it is having a desired effect, it is slowing China down. And, um, I think, you know, you gotta, gotta take the W unless you're, you're in video, I guess. Yeah. And, uh, just so it's clear, we are not taking sides here, even though we are based in America.
Uh, we just are, want to cover this export controls law and maybe, you know, we might have personal feelings on slowing China down and so on, but we don't want to sound like we are taking sides. Oh, maybe, maybe I should be open about my biases. I, I absolutely am pro slowly China down, but, uh, Yes. Okay. Yeah, it's true. We should acknowledge that we have biases and that might impact how we cover these things, but also objectively, these are the facts in this case.
And, uh, another story on those topics, the next one is that Uncle Sam, aka VUS, apparently relies on manual processes to oversee restrictions on the Chinese government. So this is about the Bureau of Industry and Security, which is responsible for implementing export licensing controls. And it's, uh, going into how that agency is struggling. Because they have these manual processes going back to 2006 and the scale of what they're doing has increased a lot.
So apparently the numbers of Chinese entities on their list, uh, which include organizations that us businesses are not allowed to trade with rose from. 2018, uh oh sorry. 218 in 2018 to 787 in 223. So that's 550 additional organizations over span of five years. And so, yeah, I think, uh, it's worth kinda being aware of just 'cause this is another indicator of how big a deal all of this is and how the US has really committed. to these things.
Uh, and there's, yeah, quite a few interesting details on what this agency does. Yeah, BIS is really, I mean, it's at the Department of Commerce, one of the most important branches of it. It's focused on this whole export control piece. And um, there was an entry, yeah, to your point, I thought it was a really interesting stat at the bottom. Um, they look at the, uh, 4, 000 entity list. applications.
So the entity list is basically the set of companies that are big no nos that you, you can't sell to, um, just without getting a license. And so they have 4, 000, uh, applications that were reviewed by the BIS over the last six years. And they say around two thirds were approved. Um, but, um, They were for 335 billion worth the rest, which are actually for more. So it's a few of the applications, but they're worth more of the money, uh, 500 and 550 billion or so were denied or revoked.
One of the important things to note too is, uh, uh, my understanding is that for a lot of these things, there's a presumption of denial. So essentially by default, you should expect that you're going to be denied the license to export, which starts to matter a lot. When you have, um, organizations like the BIS that are kind of undermanned, and so essentially you got a longer delay between, you know, approval and, and all that.
So, you know, that, that kind of means that by default, you know, you're not sending these things out, which is of course the national security objective. Um, so yeah, uh, I think kind of a good update on what the BIS is up to. It's kind of like NIST, right? We're relying on them more and they haven't gotten more funding or as much. Additional funding is they might need to deliver on their mandate.
So, you know, be great to see things like Congress, uh, deliver on a little funding package for, for NIST and maybe for BIS while they're at it. And speaking of government organizations and programs, the next one is Telling us what the US government has launched an initiative to address workforce shortages for the semiconductor industry. So this is addressing shortages of workers projected for the industry by 2030, uh, that's going to use part of a 5 billion.
allocated to the new National Semiconductor Technology Center that will distribute grants to different workforce development projects. And of course, this is very significant. The U. S. has invested a lot in the semiconductor industry of the U. S. with the CHIPS Act. And it seems that, uh, kind of the rate of university completions, uh, are just not keeping up. Apparently, uh, of the new jobs projected for, uh, the industry as it grows, uh, 60 percent are expected to remain unfilled.
So big gap in these specialized skill sets. And, uh, there's some good stats in the story saying about 50 community colleges have introduced or expanded. Programs related to semiconductor technology and companies like Intel, Samsung, and others have allocated money for workforce development.
So yet another aspect of the semiconductor industry, uh, that makes it a challenging space to, uh, get into and why governments need to spend hundreds of billions of dollars to even give their national players a shot. And the last story, Bridgewater Starts 2 Billion Fund that uses machine learning for decision making and will include models from OpenAI, Anthropic, and Perplexity.
Bridgewater Associates in this case will launch this fund, presumably an investment fund, that will use AI to make some of the trading decisions. Uh, this is part of a general transition to using more machine learning. And to me, it's interesting, uh, in, in part because the whole theory of OpenAI Basically since forever was that once we have a super advanced AI, it will just make all the money, right? That's the idea. Open AI has run with for a very long time is they won't be profitable.
And then they'll just make all the money with AI. And, uh, I guess this could be one example of how that might work. AI will just make all the money via being the best investor in the universe. And we'll see how that actually. Turns out, I like how that sentence ended and we'll see how that turns out. Uh, yeah, news at five. Um, yeah, no, I think this is a really interesting public development, uh, in this direction. So this is, uh, Man, there, there's so much here.
I mean, so there are, um, this is going to start, this fund is going to start with 2 billion of capital. They've been experimenting with a smaller fund, about a hundred million dollars. So that's chump change in the space, uh, for a little bit while now, just to confirm this is actually working. Cause obviously you need to make sure that, you know, you're, you're training in within distribution and testing within distribution. So they've actually been using this at small scales.
Um, One interesting note that, so the guy who's running this, uh, this is a, um, apparently a broader venture that's been spearheaded by their co investment, sorry, co chief investment officer. This is a guy called Greg Jensen, who I had never heard of before. Um, but he worked apparently at Bridgewater since 1996 and he actually put his own money down towards opening eyes seed round.
OpenAI's seed round, mind you, I don't know, that's like 2015, I think, um, like a long, long time ago, and then was also one of the first, uh, was part of the first money in for Anthropic. So this is a guy who's really, really tracking the space very closely. Um, it should be noted, this is almost certainly not the first firm to experiment with this sort of thing.
You know, when you think about, you know, who's leading the way on this sort of like, not just AI enabled, but basically automated, uh, trading strategies, you know, you think about, Renaissance Technologies Medallion Fund. That's probably the first one. This is a fund that was set up by former intelligence community people. Um, that's notoriously secretive, super, super, super AI first. They've been doing, you know, things that gesture in this direction for a long time.
You think about Jane Street as well, you know, a bunch of companies, a bunch of funds in this space that have been kind of, you know, You would expect would be just as forward leaning if not more so on this. Um, so this is what we know about. Keep in mind, this is just the public stuff that we know about.
Um, there has been interest, there has been interest from, I think it was a congressional report published recently, uh, that was talking about concerns over basically funds using Uh, AI to, uh, potentially disrupt markets, uh, increasing the fragility of markets, but also you can think about in the more extreme limit as these things get more and more capable, right? What's more useful than an AI model that just, you know, makes predictions about how stocks are going to move.
Presumably one that can actually proactively reach out. To analysts to get their opinions to inform its predictions and what, like what's more useful than that? Well, you know, maybe a model that can do that, uh, but not just to get analysts opinions, but to start to shape their opinions because that can influence markets. Yeah. You can kind of walk down that line of causality and ultimately giving your model more and more of an action space.
You get to the point where you start to worry about these things. You know, placing shorts on very unlikely catastrophic events or things like that, and then making them happen. Um, this is a threat model that, you know, sounds, sounds out there today. Uh, well actually less and less so I guess but that um, at least, you know, we know have been, has been uh, tracked by a couple of people who let's say would, would be interested in the space.
So, you know, Increasingly, you know, hedge funds are becoming a source of potential, uh, sort of AGI ish risk as you, as you see them sort of start to reach further and further into inaction space in the direction of fully automated systems, they have an incentive to work like crazy to just, you know, pump that bottom line. And, uh, those incentives aren't great in the context of technology that gives you more and more of a potential destructive footprint. So interesting, nonetheless.
Um, this guy's got, uh, some deep connections apparently to Anthropic and OpenAI. Hopefully that makes them more aware of some of the safety risks here too. Um, but we'll, yeah, we'll see what happens with this very large or not very large fund, this medium sized fund. Um, cause Bridgewater's total assets are over a hundred billion that they're tracking right now. Yeah, definitely. As you say, we know. So I do all this stuff too about hedge funds.
Uh, I think we are all aware of these things, right? It's all basically the big short. And that's our last story. So thank you so much for listening to this latest episode of last week in AI. As we always say, you can find those articles and subscribe at lastweekin. ai. You can also reach out to us with your feedback. We include the emails you can, uh, use in the episode description.
As always, we do appreciate if you share the podcast with your friends or enemies, and if you review it to give us some feedback, even if it's negative feedback at the hopefully constructive feedback. And now we are done and do enjoy this fun. AI outro song. Make it fun, you see, from AI agents to Chevron deference with glee. Talking Moshi, it's the buzz indeed. AI agents making moves with speed. Chevron deference, let's get it clear. For all you AI fans, the best is here.
Last week in AI, here we go Talkin on Java, now don't you know Gen 3, Alpha 2, so fun Get ready for the wild ride From the labs to the street, tech news flows Stay connected, see what tomorrow holds We've got your AI updates on the rise Innovation that'll mesmerize So grab your headphones And join the stream For the latest information Machine learning dreams. Learning dreams.