Hello and welcome to the Last Week in AI podcast, where you can hear chat about what's going on with AI. As usual, in this episode, we will summarize and discuss some of last week's most interesting AI news. You can also check out our last week in our newsletter, lastweekin.AI for articles we did not cover in this episode. I'm one of your host, Andrey Kurenkov. I finished my PhD, focused on AI at Stanford last year, and I now work at a generative AI startup.
And I'm your host, Jeremie Harris. I'm the co-founder of a company called Gladstone AI and I safety National Security Company. And actually, you know, I'll just mention this real quick. We are currently looking for help with business development in, the US Department of Defense. So if you are interested in that kind of thing, hit me up. You can, you can send us an email. Hello. At Gladstone.
Okay. I just one quick thing I'll mention there we are literally the first company in history to deploy a GPT four powered application to the US Department of Defense. And we have training products that are currently training the top leaders at the US Department of Defense. And, as well have been used to do briefings for like cabinet secretaries and, folks at Homeland Security and the State Department, all
that jazz. So if you want to make a big impact in AI, safety, AI, national security stuff, AGI, that's all our kind of our all our focus. Yeah. Hit me up. We'd love to. We'd love to chat. Anyway. That's my that's my pitch.
That's good there. I think it sounds pretty exciting. And, as always, include a link to that email and any other stuff in the, episode description. So if that's intriguing, do feel free to follow up. Also, as always, feel free to email just a podcast to any thoughts or suggestions. Contact add last week in that I we always love to hear your feedback. any point, just the stories you think we missed. Sometimes you do pick those up or any corrections. We try and include that.
This is going to be a bit of a quicker episode. We kind of have to get through it quick. So are you going to, just go ahead and jump in? Starting with the Tools and Apps section. The first story is one of the major news stories of this week, which is that Google has released Gemini. It's AI driven chat bot and voice assistant. This is really kind of a rebrand of everything Google has released so far.
So Bard and Google Google assistant are pretty much gone or replaced seemingly by this new smartphone app. Gemini and Gemini is essentially very much like ChatGPT, so it's a chat bot. You can talk to it, it can generate images, it can output whatever ChatGPT does for you, like writing emails, or writing code, etc. etc. I have not played around with it yet, but I have read various people's stakes, and general consensus seems to be that it is about church GPT three quality.
You know, there's some cases where it doesn't do so hard, some pieces where it does, but in general it seems to be about on par. So I guess, pretty big day for Google.
Yeah, definitely. And this is, as you said, like part of that Bard rebrand. We've known a lot about Gemini actually, for a while now. This is I think it's the first time I remember personally knowing about the benchmarks, this benchmark performance of these models before they were actually sort of like publicly released. So, you know, usually with like GPT four, the, you know, the thing gets released and then we learn about the benchmarks all at the same time. And the technical report comes out.
But we've known for months now that, that, that, Gemini was coming up and we knew that in one instance, it was hitting like the state of the art in 30 out of 32 benchmarks that, that they were tracking. And interestingly, that was ten out of 12 of the popular text and reasoning benchmarks that they tried, but nine out of nine and six out of six of the image and video understanding benchmarks that they tested, and five out of five of the speech recognition
benchmarks. So on the multimodality side, it seems to be like really especially strong, let's say, particularly compared to the likes of GPT four, which for now it's competing with. But I think one of the most interesting consequences of the Gemini release now or the Gemini Ultra release, because that's really what's new here. We had the Gemini Pro and Gemini Nano out before, but with ultra, we're really now, you know, OpenAI. The pressure is on them to accelerate their timeline on release.
GPT five. You know, they want to maintain their their monopoly, which is the only thing that allows them to defend their higher, not their higher price point, but their margin. They're gonna have to come out with the next version. We know GPT has been trained already, so I think this is only gonna accelerate that timeline and the race continues.
You can pay for the more powerful version of Gemini Advanced, which is powered by their Ultra Gemini, and that's for. $20 a month. So pretty much the same a monthly subscription as you have free of charge for all their, like, pro tier features. Well, you do get a two month trial from Google. So, you know, probably a lot of people are going to start trying out this as, alternative to ChatGPT subscription service, as you say, I think does result in some, kind of pressure
on OpenAI from here. And just a little bit more detail, I think qualitatively, it has a slightly different sort of tone to it. It seems to be a little bit more friendly. And sky assistant E van charge pretty, less of a neutral voice. And one final caveat is this is released now to English speakers in more than 115 countries and territories, so not yet, I guess anywhere Google is. But if you're an English speaker, you can go ahead and download it and try it out now.
And onto the other update to a chat bot that we got this week. The next story is copilot gets a big redesign and a new way to edit your AI generated images. So Microsoft has its own ChatGPT type thing called Copilot that has been continually updated and improved over time, and now there is a redesign for it kind of come with a bang anniversary of sorts for, Microsoft's efforts in the area. So yeah, it's nicer cleaner.
There's also been an update to a mobile version, and there is a new feature called designer, which allows you to edit generated content, by highlighting, different areas of image blurring backgrounds, adding unique filters, etc., etc.. So yeah, another instance of the kind of, I guess, Chad bots tool scape really maturing in a sense. And now you have a few options for something like GPT, which you can choose from.
Yeah. One of the things I always find interesting in these new kind of re skins or launches or whatever is looking at, well, what are the things that they're making available for free versus what are they charging for and how much are they charging for them? I think it's fair to say, you know, at the time, you know, Andre, you and I started to record this podcast together like about actually, you know, about a year ago, it was really green,
like, it wasn't clear. How much do you charge, for example, for something like ChatGPT. Well, we now that know the answer, it's about 20 bucks a month, right. But then, you know, this next generation, especially when we go multimodal, like this tool is you start to ask, you know, what? What are the freemium offerings like, what do you get for free? What do you have to pay for? And in this case, your designer is free for everybody to try out. But you need a subscription to Copilot Pro
to get access to some extra tools. And so what are those tools? What is behind the paywall, at least for now? What are they testing? Well, they're testing out, putting behind a paywall, the ability to resize generated content and regenerate images into either a square or landscape orientation.
And I find that interesting, like, you know, how would you how would you kind of split the baby and decide like, okay, that, you know, that's what we're going to do, that they must be seeing something in the in the use cases on the back end that that orients them in that direction. But anyway, so something to watch. I wonder if this will become again, another pretty standard dividing line when you look at these sorts of
offerings. But right now, yeah, it's all part of Microsoft's obviously, like kind of holistic integration of AI into their products and trying to draw it, you know, draw people in to Copilot Pro as much as possible. It's an interesting long term play, and it's consistent with their like their big focus on, you know, commoditize the complement make the software. The thing that you pay for or charge for and Copilot is, you know a big part of that right now.
Yeah. If you look at their description of copilot versus copilot Pro, it's kind of interesting. It says access GPT four and GPU for turbo during non-peak times. For copilot versus copilot, you get priority access to GPT four and you for turbo during peak times. So essentially there's this kind of like hidden aspect there where secretly results. You get it from an inferior AI model.
Unless the GPUs are just happened to be free, no one's using them, then you would get the good results even if you're not paying. So, yeah, it's it's kind of seems to be starting now where you pay to ensure the best results and to have the more of a budget, for instance, for creating more images or editing them various ways, kind of kind of across the board, really.
And moving on to our lightning round, we open with arc searches, AI responses launched as an unfettered experience with no guardrails. So this is, a story. I think we first cover this maybe a couple of weeks ago. So so there's this company called the Browser Company, and they have a browser called arc. And it's essentially like the.
Focus here is on, you know, kind of productivity enthusiasts, just it's, you know, a clever way of, like, organizing your browser experience to make you more productive. And their new iOS version comes with this browse for me feature. Essentially, this is an AI agent where you give the agent like a simple query that you want it to look up on the internet for you. So you want to like, look for the Gladiator two trailer and it will identify in that case, you know, like the specific YouTube video.
So it won't show you a whole bunch of options. It'll just kind of go through the internet, figure out which links make the most sense and surface those and then do kind of more, more complex tasks in that for you as well. Do some, some research, that sort of thing. It's a really interesting idea. And they roll this out, I think, earlier this week, except we're only now finding out actually, that they had no guardrails in place.
Right? So we're used to like ChatGPT, you know, ask it how to make a bomb. It'll say politely, no. Like as a large language model trained by opening. I won't help you do that. This model does not or did not have any guardrails. So there was a journalist here who tried really like for everything under the sun. There are some unsavory things that apparently didn't include in the article. But for example, you know, ask for help hiding a body and got some responses.
Some like, yeah, decent ish things. Ultimately kind of weird answers like abandoned buildings or beaches with secluded areas, that sort of thing. But, yeah, I mean, it really was over helpful. So now Ark is kind of coming out and saying, oh, man culpa. You know, like we we screwed up here. We're going to put some safeguards in place.
Seems like as of the time of writing this article, they still were not but you know, I think first of all, the this whole Ark situation, I think is fascinating from a business standpoint. You know, what happens to search engines when AI powered agents are doing the searching and therefore ad revenue isn't a thing in that sense. Like, do we see continued use of search engine APIs like, do they keep
allowing that? But then separately, like, holy crap, there are no guardrails like this is actually still a thing that we're doing. Like we're we're launching services with no guardrails. Kind of interesting. So anyway, cool article where we're sticking around, if you're interested in that sort of thing.
This article highlights quite a bit about this. It's an interesting read. I think I also liked that it highlighted, misinformation and hallucination regarding, for instance, medical advice. So that is an issue here as well. So yeah, interesting. Interestingly large fast and kind of have some of these potential negative or problematic, cases.
And I guess I did promise updates to come to probably address some of the, let's say, things that probably they should limit or in some ways kind of prevent until the next story. Brilliant labs frame glasses serve as multimodal AI assistant. So this is another announcement of a new category of device, I guess, which is this wearable AI thing? You could say. So we've had a few of these already. We've had the human eye pin, which was a little pin that has a projector and you can talk to it.
Meta has its smart glasses that have AI integrated. You can speak to it, it can do translation and various things. And now there's a new company in the fray called Brilliant Labs that is launching or announcing frame, which has an integrated multimodal AI assistant. So pretty similar in concepts to what meta has in terms of being glasses that have a camera and you can have various AI assistants by speaking to it, and it can also play around with images you get from a model.
So to me, interesting to highlight just yet another example of another company trying to see if there is a new type of device that brings AI to you via something sort of wearable.
Yeah, it's always an interesting question as to whether it's the hardware itself, like the form factor of the of the device that is the limiting thing or the software ecosystem around it. You know, obviously, famously, Google Glass flopped when it first launched in a different era now, like generative AI and in particular, as you said, multimodal, I makes certain things much more possible and usable.
Interesting to note. You know, these glasses, they look like eyeglasses, like they do not, they're as thick as ordinary glasses. They show a photo of them in the articles. Kind of interesting. So we'll see, you know, we'll see if that really does it. I mean, people don't want to feel necessarily like they're walking around with a thing on their face. I'm trying to remember what happened with, with Snap's, famous sunglasses there. I, I don't see those around a ton, but that might just mean that
I'm, you know, now into my 30. So that's more the thing. But, interesting roster of investors to, that they list here a whole bunch of really impressive people, including, Brendan Herb, Asos, the co-founder of Oculus. Eric, Makovsky, who I guess is the founder of Pebble, as they note in the. Article, also a partner at Y Combinator. So we are really early stage focused there.
And certainly Eric is, very knowledgeable hardware guy from what I remember him at y c and anyway, other other core team members on Oculus, so really, really well backed, well advised and, an interesting one to watch. You know, there are a lot of, as you said, a lot of products like this hitting the shelves. Obviously, rabbit is another kind of, you know,
vaguely analogous. It's kind of a portable device, not on your face or anything, but we're seeing more and more of the hardware meeting the software when it comes to AI.
Next story stability AI launches SVD 1.1 a diffusion model for more consistent AI videos. So it has about and update to stable video diffusion, moving it to 1.1 away from 1.0. I think we covered 1.0 just a couple weeks ago, and this is pretty much just making it so you can have more consistent video generations. Just an improvement really. But does highlight, I think, a movement in, video generation becoming more commonplace.
And I think making pretty rapid progress now and probably for the rest of us here, we'll see a lot of, text to video getting better, getting commercialized, maybe starting to be used for various applications.
Next we have OpenAI launches chat, GPT app for Apple Vision Pro. And actually, this is to the story that, earlier we're talking about with those those glasses. So Apple Vision Pro, this is an augmented reality headset, right. That Apple launched. This was I guess now last month we talked about it back when it launched. And now they're looking at integrating GPT four turbo, with this
headset. So basically would allow you to like chat with, you know, ChatGPT, if you will, using images that are collected by the headset headsets. All right. So it's this very kind of seamless, integration. There's also an audio interface so you can dictate to the headset. So now you can sort of talk to your eye all day long. It's yeah. I mean, it's an interesting development. It's also, another app.
ChatGPT becomes, one of, of many apps, on this platform, there's 600 new apps that have been built, for vision OS, which is the operating system that powers Apple Vision Pro, and, and those, those apps. It's interesting that you take advantage of a couple of different features that are built into this ecosystem. There is, I think called optic ID, which is biometric authentication. It's it uses eye tracking and like, iris, iris recognition. So pretty cool spatial audio.
So it it creates a this creates like a realistic, directional audio effects. So you can hear stuff that's coming from the right direction. And then vision Kit, which allows people to create apps that can actually, you know, be multimodal. You've taken images and and audio and all that. We don't know if ChatGPT will actually be using these.
So it's not necessarily clear exactly how it's integrating at the, software level with the, with Apple Vision Pro, but it looks like I mean, I would guess it it'll be using some of these ready made handles. Yeah. So another instance of Multi-modality really, you know, hitting the mainstream and, with hardware to support it.
And, kind of reminds me, I mean, we we are not going to have Apple Vision Pro releasing as the story since it's not exactly AI. But at the same time, I'm reminded of that kind of behind the scenes where, you know, language models, but the spatial computing, as Apple likes to call it, tracking your feed space, being able to understand where you are, and how you
move all of that is AI. So really, Apple Vision Pro is a pretty impressive example of vision AI being able to understand the space around you, getting really advanced and getting to a point where you can do real time, you know, very sophisticated understanding. And, yeah, I think if you have a couple thousand and you'd actually got to sing, I guess now you can use ChatGPT in a nice interface, to applications and business.
The first story is about self-driving cars, and in particular, a new incident in San Francisco in which your Waymo robotaxi hit a cyclist. Now, before you worry, this is, relatively speaking, not, a huge incident. So what happened was an autonomous Waymo car hit a cyclist, causing minor injuries. The cyclist actually left the scene. And this was reported to the police and relevant authorities. I guess right after. And there some details about the story.
Basically, this was a slightly confusing traffic scenario where there was a four way stop. There was a truck, that kind of moved through its stop sign vendor. Waymo started moving, but there was a cyclist behind the truck. But. Was hidden by the truck. So they wound up being, a kind of small collision of overtime or did brake hard to try and avoid it. So, unlike the, incident last year of crews, that, of course, sparked, huge change in the, I guess, fate of that company in this case.
It doesn't seem like a terrible kind of outcome, but I think worth highlighting as an example of something that will inevitably happen as Waymo expands to more and more territory. Stuff like this will just happen. So this is, I think, the first really widely reported story of, collision involving a person with Waymo. And, it's interesting to see that in this case, it turned out to be maybe not the worst kind of, story for them.
Yeah. And you can really see the, you know, trigger happy to call the authorities as they should be, you know, after something like this happens. But my guess is at a, at the very highest levels, they'll be extremely aware of, of what happened with crews and trying to draw as sharp a contrast as they possibly can. I mean, this is like if it's a PR problem at that point. A couple interesting notes about the, the, context of the
accident, too. We don't have that much data about it, but it does seem like it was, pretty, you know, clear day, broad daylight. It was around 3 p.m., and the intersection, apparently is pretty flat. So, you know, not what a mitigating factor is. It seems like the cyclist might have come out of nowhere. Or at least that's how, a human driver might interpret it. But, yeah, really, it's obviously really hard to to decide.
This is the classic philosophical quandary, you know, how do we wait an AI induced accident relative to human induced accidents? But yeah. So one of the things they do mention that I thought was kind of interesting. I wasn't tracking this. So Waymo apparently has tallied just over 7 million driverless miles, which is a little bit more than cruise, which is about 5 million miles. I didn't realize that they were so close. In terms of, the data that they've collected.
So kind of interesting, apparently humans, cause on average, one death about every 100 million miles driven. So, you know, if we're looking at these sorts of accidents happening in the low millions of miles, it, you know, it does raise certain safety questions. So this will obviously improve as more data is collected and all that. But, it's an interesting little, little benchmark.
That's right. I think, behind the scenes, there's been more movement and more kind of conversations on the regulatory front, stemming from the cruise accident and just generally some of the disruptions happening in San Francisco, traffic due to cruise and Waymo. So this adds to that in a way where now there's another incident that will be kept in mind and I guess inform some of these conversations. It'll be interesting year, I think, for self-driving, we are kind of still
pretty early on. Waymo is trying to expand to us. Angeles is trying to expand the service area in the Bay area beyond San Francisco to a much larger swath of. It's going to include highways and freeways and smaller cities. So it'll be yeah, kind of a pretty pivotal year for Waymo, potentially. And if if it was a worse accident, that could have been a real, change of fate. But, it seems like this didn't wind down going down that route.
And moving on to our lightning round, we're opening up with canon plans to disrupt chipmaking with low cost stamp machine. Okay, so, we we do this every once in a while where we do that bit of background and we still need that dedicated AI hardware episode. But right now, the world's most advanced, semiconductor lithography machines, are extreme UV lithography machines, EUV machines.
These are the things that a Dutch company called ASML makes that they sell to chip fabs like, the Taiwan Semiconductor Manufacturing Company or Samsung or SK Hynix. And then and then those companies then use those machines to make the chips, the, you know, the GPU, the power GPUs, and so on and so forth. So these lithography machines are kind of pretty far upstream from the whole chip ecosystem. They are critical and they are super expensive.
So ASML, which is by far the leader in the space, each of their machines will cost on the order of like $100 million. Plus like these are massively expensive machines, hugely tons for delivery. So canon is coming in and saying, well, wait a minute, maybe we can come up with a a machine that is more efficient. So what these traditional EUV machines do is they fire like a crazy high intensity, high frequency laser to kind of, well, roughly speaking, etching features onto the semiconductor chips to
etch in those circuits. What the cannon is doing is they're saying, well, what if we try like kind of a stamping strategy instead to try stamp chips? Designs onto silicon wafers rather than etching them in using kind of high frequency, short wavelength light. So this new strategy is pretty controversial. They've been at it for like 15 years, but they seem to be thinking that
this is now starting to mature. They think it's ready to kind of coexist with extreme UV lithography machines and compete in the market. I was surprised at this. They say they're actually going to be starting at the five nanometer resolution. So so that five nanometer process, for context, this is equivalent or this is the process that led to the Nvidia H100 GPU. That's the the best GPU, almost the best GPU on the market right now, at
least from Nvidia. And so they think they can already kind of get down there. They think they can get all the way down to two nanometers, which is the node size. That will probably be at like a year and a half to two years from now. So, you know, they're ambitious here. The big question, any time you look at these sorts of machines is yield. You might be able to make some of these chips using these technologies.
But if most of them are crap, if like, you know, if 50% of your chips end up being crap because your your process is just unreliable, then well, that means that you're, you're going to have to make twice as many chips to produce, the right number of kind of quality chips. So that's the big question right now. It seems like yield is a big challenge for canon. There are big hints that the yields aren't necessarily
great. They, they came out and said in this release, you know, in regard to defect risk, I think our technology has largely resolve the issue, which isn't exactly inspiring, but they're doing now their first deliveries for, for trial period. So this is one to watch. I'm not super bullish on it, but if it ends up working, this sort of thing, you know, could be the type of thing that allows you to make these things from much cheaper. It's apparently like 90% less expensive, to
produce at scale. That's their claim than than these ASML machines.
And that, of course, would have major downstream impacts on AI because a lot of the cost for setting up at least starting to work on a major project is getting access to a lot of chips, right? Getting access to a lot of GPUs, a lot of upfront investment. Unless you're trying to get access to chips or GPUs in the cloud. But now that's actually pretty hard. Now it's not easy to get a ton of compute, you know, a lot of competition for people trying to get that.
So if we do get a breakthrough, I guess you could call this, that could have major impacts in presumably a year or a couple of years. But, we'll have to see. Next story. Also dealing with hardware. U.S. industry group calls for multilateral chip export controls to address disadvantage over Korea and other allies.
So this is all about how the US, as we have covered quite a bit, has pretty stringent export controls and regulations regarding how chips from the U.S. can wind up in China, and also has prohibitions on US support for advanced fabrication for facilities in China. So there is, this group argues, a disadvantage because US companies cannot use some of these fabrication options in China versus Korea, Japan, Taiwan, Israel and so on.
Can and yeah, this is kind of just highlighting that it has asked the Bureau of Industry and Security to, quote, do all that is possible to establish new controls that are better for the US in some way.
Yeah. And and I think the this was sort of the, the natural consequence right, of the US coming in and setting up new export controls. First you usually do it, you know, you might do this unilaterally in this way to get it done quickly. And then, you know, you kind of bring in the wider community of international stakeholders. Japan is already kind of, set up their own pretty strange export controls to near the US. So, you know, they they list them in the list here.
I'm not so sure that they're like maybe the first one you would list. But but they are being cited by this, Semiconductor industry association that's actually kind of written this, this comment to, to visit the Department of Commerce. Yeah. So I think that's something that gets resolved over time, but it's worth flagging like this. Absolutely does. You know, the export controls, make a ton of sense, at least in my estimation. I don't like them, as you know as much as anyone.
Like, I think it's, it's sucks because it does hurt industry. And this is really us seeing how it's hurting industry and how industry expresses that. But we did hear from, the undersecretary for commerce and industry Security. This is Alan, as he came out and said, look, basically we're working on it. We're already in preliminary talks with South Korea, to set up a new export control to cover a whole bunch of stuff, including semiconductors, quantum computing, and so on.
So this is in train, but in the transient, you know, the US had to move really fast to kind of shut a lot of these loopholes that existed after the previous round of export controls. And this is just. The collateral damage that you might expect.
And actually, one more story about export controls. And this one is a little bit more exciting, a little bit. You know, you can imagine this being in a movie or something. The story is that the U.S. blocks shipment of 24 Nvidia GPUs to China over concerns about self-driving truck companies. So the US Department of Commerce has literally prevented 24 A100 GPUs. Not even, you know, some crazy amount.
This is a relatively small amount of like, medium strength GPUs from being shipped to Chinese self-driving truck company to simple. And this was actually intended to Australia, but the concern was that they might wind up in China. And apparently this company, too simple, has been under scrutiny from the US government for years and has been investigated, for foreign investment and accusations of espionage. So, kind of dramatic, I guess.
Well, yeah. And I two point that was exactly why I thought this was so interesting, the 24 A100 GPUs. You know, the fact that, the Department of Commerce's is on it to the point where they're tracking down, you know, you can't sneak two dozen A100 GPUs out of the US. It's good to see. So, yeah, apparently, you know, there's all kinds of back and forth here. The, the argument is being made by two simple that. Yeah. Look, we have, a subsidiary. Yeah. Subsidiary in the US.
That is the reason we want to send these out to Australia is just that the subsidiaries kind of shutting down. It won't be able to use them anymore. Which, you know, you hear that? You go, okay. Yeah. You know what? Why don't you let them ship it? But ultimately, it seems that according to people familiar with the matter, as they say, the CEO personally did want to get the GPUs to China. That's the claim.
Apparently these orders were not kept in writing, but his his assistant coordinated with two simple Chinese office to ship the 20 481 hundreds to Australia. So an apparently the company's lawyers even gave input on the shipment, saying it was illegal to send the GPUs to China, but not to Australia. So sort of interesting. It does seem, at least based on this report, that the intent really was to
circumvent the export controls. Again, I just I think it's, it's it's laudable that, that they're on it to the point of, of tracking down those 24 A100 GPUs. It's a pretty big achievement for business.
And one last story in the section, this one also about hardware and GPUs. Nvidia reportedly selects Intel Foundry Services for GPU packaging production and could produce over 300,000 H100 GPUs per month. So that's the story. Packaging is one of the kind of major bottlenecks that we've covered and has been, yeah, I suppose an issue or one of the challenges in scaling up and keeping up with demand.
So this is, just telling us that it seems like there's movement on that front, and Nvidia could be setting up to produce a lot of these very high power h100 GPUs, which a lot of companies want to get their hands on.
That's right. Yeah. It's an unconfirmed report, but just based on the details, it does seem plausible. And, and it would be an interesting story. So, for context, yeah, TSMC, of course, is, as we've covered many times, this is the the best, most advanced semiconductor foundry on planet Earth. They're the only ones really able to to do the three nanometer process. That gives us the current version of the iPhone, for example, the iPhone chip. They also, though, are leaders in their packaging
technology. So once you fabricate a chip, right, you still need to package it together with a bunch of other components and other ships to make the, for example, the GPU, the processor that that you actually you're going to sell that packaging step is not as specialized as chip fabrication itself. It's not like there's literally just one company that can do this, which is the case for chips at three nanometers right now, but it's still
reasonably close. So the latest technique is called chip on wafer substrate or COAs. So you'll hear us talk about that quite a bit, especially in the future. This packaging step is required for the most advanced some of most advanced processors that are produced. You need really good chips and they need to be packaged really well. The packaging process has turned into the bottleneck. Sort of in, in mid 2023 that that started to be the case.
The prior to that, it had been more the chip fabrication stuff, but now chip fabrication no longer the bottleneck, now it's more on the packaging side. So at this point Nvidia, which again, is is just they're trying to churn out as many GPUs as they possibly can. They need all the packaging production they possibly can. And they've already saturated basically TSMC, TSMC has no more capacity right now. They will soon by the end of 20.
Before they think they'll double it, but for now they don't have that capacity. So Nvidia is now looking to other sources, and Intel is keen to compete with TSMC. Gelsinger, the CEO of Intel right now, is trying to pivot the company in the direction of doing just this. So this is a great way for Intel to position itself to kind of get a little bit of a taste of the market start to draw in some, yeah, some, some, opportunity to develop in a slightly different cost process
than TSMC's. So all this is going to have to get tested out in kind of in a shakedown cruise. But this is also super, super consistent with Nvidia's strategy in the market. Their game plan is to be super, super aggressive to crowd out everybody else to over order if they can just fill the capacity of everybody in sight, on packaging, on chip manufacturing. So they really, really want to go in and get all that Intel capacities just
opened up. That's good for Intel, good for Nvidia, and really bad for folks like AMD, for example, if they are also packaging constrained. You know, now there's like less room at Intel for an AMD order. And that's really how Nvidia has managed to pick up so much steam. So all of this, again, grain of salt, unconfirmed reports, wouldn't be surprising and would be good news for Intel if this went forward. Apparently the deal is for something like 5000, wafers per month.
So, anyway, wafers are these these things that you make the chips on? It would add up to about 300,000 Nvidia H100 chips, per month. So that's a that's a hell of a lot. And, yeah. Well, we'll see if this goes through. This would be an interesting next step for Intel.
Clearly, as listeners can tell, Jeremy is much more of an expert on this topic than me. So you got all those details? But as I always like to say, hardware is so important. And we keep going back to Nvidia GPUs just because it is at the foundation of trying to get GPT five or whatever, right? OpenAI and semis are all about hardware a lot
because of this. And it seems like also just in terms of feature impact, having a potential, other source, for their production, having a more diversified supply chain for Nvidia is very meaningful. So this is a pretty significant development, if true. And if it works out, moving on to projects and open source. And our first story is Allen Institute for AI launches open and transparent, all mo large language models. So well an institute for AI or AI two has been around for quite
a while. It was created by Microsoft co-founder Paul Allen and has done a lot of stuff in AI over years, and Olmo is their latest, I guess, big initiative.
It stands for Open Language Model and it is billed as truly open source in the sense that they have released the model itself, but they have also released the training code and the data for the training, which is in contrast to pretty much anywhere else you look Lu, llama, Falcon, etc. you might get to model, but you usually do not get to code and you usually do not get the data. And the data here is its own story.
Really. It's this, dome data set which features more than 3 million, 3 trillion tokens. So a new open source dataset for training large language models alongside with this new open language model and overall framework where this code for inference, code for training, they also release training metrics, training logs. So just more than any other open source release to date pretty much. And they have released almost seven b specifically.
So kind of a small a large language model, you could say generally we get like freebie seven B and seven B is at a smaller level where it's still kind of a GPT ChatGPT ish, but usually not capable of advanced reasoning or a lot of stuff you would want to do. But still, you know, pretty significant. So yeah, an exciting day for open source with this release.
Yeah, it's funny you said small, large language models and I couldn't help but go like, oh yeah, yeah, yeah, that is a thing we don't have. We don't have a word for that. But it is absolutely a thing. You're totally right. It's Yeah. It's interesting. Right? It's a continuation of this trend where it seems like people keep going, like, oh, yeah, you think you're the open source guy? Well, like, well, we'll like open source the code.
And then they're like, okay, well we'll open source that data set and then they're all open source. A picture of me sleeping before I hit the run button on the. Anyway, there's like, it's just this escalation of how open can we be? And this is, I think maybe where it just ends up like everything is open. I think one of the. Really interesting pieces. We saw this, with stable LM, there's two stable LM two. I think it was back in January.
Where? That was the first time I was aware of this idea of a company that would release the, like, a check point, version of the model. So in their case, they released the last training checkpoint, so that people could basically pick that up and keep developing it. Essentially the idea there was, you know, you, finish your training run and then you do a little bit of post-production, you do some fine tuning, some reinforcement learning from human feedback kind of make the model
well-behaved. And it turns out that by adding those steps, you actually make it harder for people to keep the training process going after that. And so in that case, they decided, yeah, to release the last training checkpoint here. The Allen Institute is releasing a whole bunch of training checkpoints, about 500 for each model, from every, for every like 1000 steps, during the training process. So that's kind of an interesting additional bit of openness.
It's all Apache 2.0 license. So extremely open source like genuinely. Yeah. So that that was interesting to your point. Yeah, absolutely. On the, on the data set, that's another dimension of it. And they're also releasing their Paloma evaluation framework code base. That's another interesting one. We haven't seen that a lot. Right. You release the model and you release the code. You used to evaluate the model from a safety standpoint.
Like, I really like that. If we're going to get the risk that comes from these folks open sourcing more powerful language models, we might as well get them to open source the evaluation framework. So, so we know how the model was evaluated and can have the open source community at least take that apart. So, yeah, really interesting development. And I'm curious if this then puts pressure on meta, you know, to get more open at their end to.
Right. And, kind of follows up on a little bit of a trend we've started to see this year of a lot of movement in this smaller, large language model space. We had 52. We had a couple other ones we covered in recent weeks. So at 70, this is in a category. And I think, you know, if you thinking through the implications, this is probably most impactful for researchers. Just because you have full access to training code, to the data,
to a checkpoints. So we are I would expect to see some papers coming out looking for interpretability results for maybe like, you know, just fundamental understanding of how language models train and how their training dynamics involve things like that. This would go a long way towards enabling that. And is a real boon for the academic community and potentially R&D, but I think probably less than factual on the industry or startup front. And now on to a large, large language model.
The next story.
The cut off funding.
Well, I think this next one definitely qualifies as large. And we are talking about small 72 B, which is in this article headline is called The New King of Open Source AI. And that is because it is now at the top of the Huggingface lamb leaderboard, which just Cambodge combines a bunch of different benchmarks. This was released by startup abacus, a AI, and it is really a fine tuned version of a previously existing model called Queen 70 to be another powerful language model released
just a few months ago. So I guess the interesting bit here is this startup took that released model and fine tuned to train it some more. We had don't have a paper yet. They said they would work on a paper and can disclose more of the details of how they got there, but this release, it is at the top and is on par, or perhaps even better than GPT 3.5 and Mistral Medium. So yeah, another really good large language model is now open source, and they are both a people to build on.
Yeah, I'm going to ring ring the bell again of her. Beat the drum, beat the drum again of like. At a certain point I start to wonder, you know, all these language models coming out, and it's less the case for this one because it is. It does seem to be field leading, but some of these more like kind of, you know, it does really well compared to the competition. It 7 billion parameters is something these companies that are releasing these models, these are not cheap to train.
And it makes me really curious about whether that strategy keeps holding up. But one interesting thing to note about this is it is so yeah, it's sort of built by fine tuned by abacus. I, from a model. Yeah. That's right. It was originally, made by Quinn, which is a team at Alibaba. This is itself kind of interesting. Right. You have of I think it's the first time I'm aware of, of a prominent Western effort, especially one that was this effective that built on a base
Chinese model. And this has you ask questions about what are the licensing terms for Quinn? 72 billion. What are some of the data poisoning risks as well? Right? I mean, presumably Alibaba can't just train a model that, you know, we'll talk about the Tiananmen Square massacre, for example, and things like that. So, you know, how does that interact with this and is a Basc is now kind of taking on some of those properties. I, I'm really curious to dig into this aspect.
I haven't, but this is something that you always, you know, have to start asking about. 0.1 I when they released their latest, their latest model that famously bound to Chinese kind of, the Chinese legal system. So if you had an issue with it, you'd have to adjudicate in a Chinese court. So all of these things like the open source dimension, this is actually an important for like international, a strategic
playground now. And, anyway, I'm curious to see what comes out of this, this dimension of it with, with, smog 72 billion.
Wharf highlighting this leaderboard is looking at various benchmarks Oracle swag and gorilla you. So as we have said, generally speaking, benchmarks that don't rely on human rankings are very indicative of performance. But at the same time they are not kind of a full story. It sometimes when you actually use it, you find that it's bad at certain things, good at other things. So it is.
Leading the pack in these numbers doesn't mean that it's necessarily better than GPT 3.5 if you were to use it, but still a pretty significant outcome. And it's interesting if you look at the board, you know, not just smog 20 to be all of these, things are at the top are some sort of, work on top of an existing open model. There's like almost 70 to be lower. 1.8.7 DPO just people playing around with additional training and improvements on top of what's been out.
And yeah, this is a showcasing what happens when you open source. You know, people can take what you do, improve upon it, put it out there, someone else improves upon it. And this is where we get to and onto a lightning round. And we start actually with the announcement of Kwan 1.5. So we just covered how for initial Kwan was a basis for this, smog 70 to be.
And it just so happened that Kwan at 1.5 was announced again with various sizes, with up to 72 billion and as small as, half a billion parameter models. The update has a variety of stuff. It has quantized models which are generally cheaper to run and generally better. They have a pretty large context length 32,000 inputs. Whereas looking back a while ago it used to be like two 4000 was the standard.
And they're really competitive, especially the small ones are competitive at this more large language model field. So another I guess, cool update. And interesting to see the Queen team continuing to expand and, put stuff out there.
Yeah, I think two really interesting notes. At my end I think here. So first off, so this update includes quantized models. So when you quantize a model basically, you know, you know, let's say you could train the original model in where each of the weights in the model has like, you know, 16 digits of floating point precision. Now you don't necessarily need that much resolution essentially
in your numbers. So you can quantize that basically reduce the resolution with which the weights are described in the model. You usually lose some performance because of that, but it makes the model lot smaller. So you can pack it on smaller edge devices. So it it runs faster and cheaper and so on. So they're including into four bit integer quantization and eight bit integer quantization. That's that's sort of interesting in and of itself.
The trend we've seen more and more as well. And the second piece is that this uses, so they do an additional bit of, fine tuning of the model on human feedback. And, you know, if you're a long time listening to the show, you're familiar with the idea of reinforcement learning from human feedback. That was, of course, the way GPT four was trained. The way ChatGPT was trained was first launched.
But now there folks are kind of moving on to the thing called DPO to direct policy optimization, as well as proximal policy optimization, PPO. These are two strategies that are a lot more efficient. We actually, I don't think I don't remember if we've explicitly talked about DPO and its implications on, on this environment. We will later, this is maybe something to punt, but, but it is an important, upgrade on top of reinforcement learning from human feedback.
It's noteworthy that it's now actually actively being used by, you know, Chinese companies, in their own internal efforts.
One last note, for me, and this one, I guess, worth highlighting, and I didn't mention this initially. The high end, the largest version 70 to be, of quant 1.5, destroys Lama to 70 B and is even better. And Mistral 8X7B across various benchmarks.
So this could be seemingly kind of the leading edge in terms of what's been open source and the similar looks like to smog in terms of some of the numbers, on the benchmarks, smog 72 B so in a way, it's interesting, like smog 70 to be quant 1.5, 72 B are all movements that smash the previous records. And, yeah, I guess we're we're just going to keep seeing this until we saturate the ability to improve via further fine tuning or further tricks and then whatever.
So an exciting week for open source.
Seriously. Up next we have Huggingface launches, open source AI assistant maker to rival OpenAI's custom GPT. So here we have basically Huggingface saying anything you can do I can do better. OpenAI goes and launches the, you know, the GPT store and hugging face is saying, hey, you know what? We're going to kind of do the same thing, except our stuff is going to be free. It's going to be all based on open source, open source models.
It's it's sort of relative to OpenAI's custom GPT builder, which costs 25 bucks a month. Right. So decent savings. Yeah. You can build a new personal hugging face chat assistant as they advertise in two clicks. So you give it a name and avatar description. And you can choose any available open source LM so you think here. Llama two think mixture model or things like that. You can have a custom system message like OpenAI system prompt. At least I assume that that's, that's the idea here.
And, as well as different, you know, prompts to kind of start, the, the, the text generation process. So I think it's really interesting. It's, it definitely is derivative. It's clear, like the the page itself looks a lot like the GPT store page. As they point out in the article, even down to its visual style. They have custom assistants that are displayed like custom GPT in their own rectangular baseball card style boxes with circular logos inside.
And this is true. You know, I clicked on the article, and sort of like, or clicked on the, through the through link and you can see, yeah, you can see that very much looks like the GPT store. So definitely recycling, you know, when that, when you can recycle. But a cool and interesting development for sure.
Right. Not necessarily like competitive in terms of the usefulness of these will not have some of the nice fancy features that you might get from ChatGPT like retrieval, augmented generation or web search, stuff like that. So this is really just a way to build and play around on top of the open. Big models like make Straw and Llama too, although, you know. There. As we've seeing with the last previous stories, new open
models come out all the time. So it could be that in a month there is an extra cool model and someone can just spin up or upgrade their chat bot. So, an interesting little project by huggingface, I think. And, we'll see if this gets some traction or not. And one last story in this section. This one is about mg I e a revolutionary AI model for instruction based image editing. That's from the headline. And this is, standing for m l l m multi-modal language model guided image editing.
And it is a tool that was developed by Apple in collaboration with researchers from the University of California, Santa Barbara. And it allows people to edit images with text, essentially with, allowing people to just say, you know, make my image darker or have, higher contrast, etc. and it goes ahead and does that for you using these multi-modal language models. So that's pretty much the story. I think it's interesting to see Apple getting into the open source game a little bit.
Typically we have not been in that space, but this is an open source AI model that others can build upon.
Up next, we're in our research and advancement section and we're going to open with a theoretical paper. I know everybody gets excited when they hear theoretical paper. This one's called Learning Universal Predictors and it is from Google DeepMind. It's it's really interesting. It is theoretical. So let me just set the scene a little bit. There is a process, a predictive process called Solomon off induction. This is basically the most powerful universal predictor that we have.
Like if you get a set of data of data that you observe in the world, this is the most powerful theoretical process that we have for figuring out what function is generating that data. Right? So you get a bunch of observations in the world. And and you want to know, like what what what is the thing that's causing all this to happen? What is the function that generates this data you can think about? I mean, that's kind of what physicists do, right?
When they look out in the world and they make a bunch of observations and like, what's the law of physics that that produces that this. Right. It's the same thing that we do when we train AI models to, for example, learn to do text generation, right? They read all the text on the internet. And what they're trying to do is figuring out what is the function that produced that text. In a very deep sense, they're kind of asking, what is the universe that
produced this text? So there's a very deep connection here with the idea of AGI, because in a sense, that's what AGI is all about. Given a set of images, how can we find a function that generates those images? Right. If we can then we can generate those images. And so and again, you know, to really make the point on AGI here, you know, if you get a set of observations about the physical universe, how can you find out the function that generates those
observations? How can you essentially decode the laws of physics that govern the universe themselves? So this is really like the idea of, of, Solomon off induction is arguably I mean, if you could make a Solomon off induction machine, you would have an AGI, right? An efficient one if it was actually scalable.
So now the big question that we're trying to answer in this paper is, okay, can we show that Transformers can actually approximate Solomon off induction if we give them enough data, enough computing power? If you could show that, then you would have shown in principle that they really could allow us to get to AGI and Solomon off. Induction itself involves a couple of different things. So there are three main ingredients. The first is what it tries to do is kind of go, okay, I've got this data.
What are all of the possible functions that could conceivably account for this data? Right. What are all of the possible, I don't know, theories of physics that could explain what I'm seeing or for something more mundane, right. If we're trying to explain, like the NASA moon landing, what is the all the possible hypotheses that could explain it, right? One is that NASA spent a bunch of money to land a rocket on the moon, right? The other is, there's a conspiracy theory or a million different
conspiracy theories. Right. So essentially, this process of Solomon off induction must consider all of the possible laws of physics, all the possible explanations. It is a massively computationally expensive task. In addition to that, it has to assign a probability or like a weighting factor to each of these possibilities. And that'll depend on how, roughly speaking, how complex they are from an information theoretic perspective.
So the idea here is, the more bits you need to describe a hypothesis, the less you're going to wait. It, wait, that that hypothesis. And this is just like the philosophical principle of Occam's razor, right? If you have a super complicated explanation for something. Relatively simple. That's probably unlikely to be true. Right. And this is actually the reason why a lot of conspiracy theories are unlikely to be true, because they require, you know, a whole stack of
things to, to be true. At the same time, it's a very complex hypothesis, whereas often the reality is a lot simpler. Not always the case, but often reality is a lot simpler. And then the last pieces Solomon off induction relies on, Bayes theorem. So basically updates based on Bayes theorem, which is the standard way that you, you know, considering new evidence from a mathematical perspective to update your hypotheses. Okay. So, this is hugely impractical.
It needs a ton of computation to execute as is. So the core question is, can neural networks approximate this process? And here I'm going to very, coarsely summarize this paper and say that the answer is maybe it seems like large, language models, Transformers, also LSTMs, which, like long short term memory networks. These are from the pre transformer days. They showed in some sort of carefully designed scenarios, optimal performance that's aligned with being able to support this kind of
Solomon off induction. So this is a piece of evidence in favor of the generality, the general purpose capacity of not just Transformers but LSTMs and other kinds of neural networks as well. So it's a it's kind of data point. You know, when we have these conversations about whether, you know, current transformers are enough to get to AGI, from a strictly theoretical standpoint, this is a piece of evidence in favor of that.
There are all kinds of practical questions about, you know, how much in practice does it take to get here? But, anyway, sort of an, an interesting data point to put on everybody's radar as we try to think about what the future of AI looks like in the next couple of years.
Right. As you said, a very theoretical kind of math paper. This is from DeepMind. There are some kind of results or implications you might take away. They do compare Lstm and RNNs with transformers and transformers, as we've been seeing for years now, generally seem to do quite well. And and in some cases better.
So yeah, hard to know exactly what to take away from us, unless you're thinking about the deep theoretical questions of machine learning and AI and neural nets, but a very kind of cool paper if you're into that sort of thing, I guess.
Yeah, there is one piece is they do study the scaling properties of these systems through the lens of Solomon off induction. And they do find that as you scale, the capacity to do this to, to carry out tasks aligned with Solomon, off induction goes up. So it it does provide to some degree, an almost causal mechanism that explains why these systems get more general over time. They start to look more and more like Solomon off induction machines.
That's, you know, to some degree, a little bit more, more practical, but certainly not, not fully yet. You're right.
And onto our second main research paper. And this one is a little bit more on the practical, empirical side, although it does kind of still study, the properties of neural nets rather than try and get some good results or something like that. The paper is Can Mamba learn how to learn and comparative study on In-context learning task? So as we've been covering for a few months now, Mamba has been this new type of neural net that is different from the most popular kind, right?
With default TV's, Daisy Transformers. And in the past maybe half year, roughly, people have started looking at these state space machines that have some better properties in terms of how expensive they are to run and how they potentially can be as good as Transformers while being less expensive. So there's been some initial work, in this front, including Mamba, which is kind of a best to date example of a state machine, or at least it was a few months ago.
And now people are starting to really explore their properties. And this is an example of that, where this paper asks, can, these kinds of models do any context learning where in-context learning is basically just like given an explanation of what you want, you want to model to do in the input or the context, can it then do that without being explicitly trained to do so? So in-context learning super important.
This is like the reason that large language models and transformers are mind blowing is that without explicitly training them to do stuff, they just kind of can do it when you tell them to. And the basic answer, was paper to this question Can Mamba learn how to learn? Is it can do in-context learning it does learn it. It is worse. In some kinds of things, then Transformers and better and some other kinds of things.
And this is not surprising. There are some other paper we are going to kind of go into which shows that, you know, if you want to do copy paste from the input, for instance, transformers are better, which is make sense roughly speaking. Anyway, this paper has some results. And that front. It also shows that you can combine a transformer and mapper architecture to get the best of both worlds, where it is able to do the best on every kind of task of these, like 12 different example.
in-context learning tasks, which I have looked into it. So to me, pretty exciting to see more research into the properties of Mamba and more. Yeah, with more variations of neural nets, models that might be kind of the next evolution of what we build on.
Yeah, it really also makes me wonder about the prospect of like a mixture of experts type of architecture, you know, featuring, featuring like Mamba and Transformers, you know, I wonder how that would how that would learn to optimize and whether you could actually squeeze more juice out of lemon in that sense, in the same way that they try in this paper, kind of combining them together for your end to end training. Kind of. Yeah, kind of cool. And we'll see what what Mohammed does next.
Yeah. I will say mamba form are not the best name for a model. I, I really hope you won't have to keep saying Mamba former all the time in the future.
You hoping that it'll fail just for that reason?
I mean, I'm maybe a little bit. You know, I just like we did a catchier title for our neural nets and that's former. That's all I'm saying. Transformer is just more fun to say. But anyway, yeah, it's a cool, cool, paper activated lightning round. And you're going to go quick with a couple of more research stories. The first one is a music oral aligning music generation to human preferences. And that's pretty much what it is. So we've touched on human alignment quite a bit.
That's when you take a language model that has just been trained to do autocomplete, and you align it to do what humans actually want it to do, not just out of complete, but probably like, you know, actually solve whatever task I gave you to solve. And don't give me conspiracy theory answers, even if that's what's most likely in your training data. Well, I just kind of misspoke a little bit. Alignment is a general concept in AI. Just getting models to do what you want them to do.
And so this paper looks into how we can align music generation models. So going from just training it out of completion for music, text to music, in this case they show you can actually you fine tune similarly to what you do with large language models on human preferences of music, and you get something that's better.
And next we have FP six learn efficiently serving large language models through FP six centric algorithm system co-design. Boy, that's hard to say. Okay, so floating point FP that's what it stands for. It's a hardware story again. So one of the things to recognize about, current AI training runs is that they are bottlenecked by a thing called the memory wall.
Basically, there's this like this challenge where, if you're going to train the model, you need to keep re uploading the weights of the model, during inference. And, sorry. This is sorry. This is just, the inference process. It happens also during training because training involves inference. But it also involves backpropagation. Anyway, during inference, your, your speed of inference is mainly limited by the time that your GPU needs to read, the model weights to actually, like, pull them up
and start to work with them. And so the more, floating point digits, the more floating point precision you use to represent the weights in your model, the more time that process will take it, and the more memory also your model will take up in your GPU. So for context, GPT three is like a 300 gigabyte model, right? It's got a ton of it takes up a lot of a lot of space. An Nvidia A100 or H100 GPU has only Ada 80, big of memory. So it's like, 80GB of memory.
So essentially you can't fit GPT three onto a single chip. And so, you know, wouldn't it be great if we could kind of quantize, if we could reduce the resolution of these, of these, numbers? So recent studies have shown that the optimal, amount of compression, like how many digits we actually allow or how many bits we use to represent these numbers. The optimal number is often six bits. So there's eight if four bit quantization there's eight bit quantization.
Those are very kind of commonly used. And we've actually talked earlier today about a four bit integer quantization eight bit integer quantization. In the context of an open source model. It was just released. But the middle ground of six bits actually seems like it's a really good trade off between the cost of inference and the quality of the model. And there is no efficient system, the system support for six bit matrix multiplication on modern GPUs.
And so what this paper is doing is they're coming up with the first, like full stack GPU system design scheme that has support for six bit, five bit and three bit as well integer, sorry. Quantization. So essentially they're allowing us to kind of take advantage of that sweet sweet middle ground. And yet they have some results. They, they basically accelerate inference on Lama to the 70 billion, parameter version by anywhere from like 1.7 to 2.7 x relative to baseline.
So this really is a pretty big lift. It seems like a niche thing. Like why should we care about six bit representations of these. Models and efficient systems that allow us to to run them that way. But it turns out that six bit. Hey, it's actually like it's a sweet spot between 4 and 8 bits. And if you do it right, you can get significant speedup. So another interesting story is to how the hardware relates to, the efficiency of the models when they run on these systems.
Well, I think that was a pretty good summary, so I don't need to expand on that. Next story is agent board analytical evaluation board of multi turn agents. This is a new benchmark and open source evaluation framework tailored to evaluating LM agents. So not just large language models that are copy text and do NLP tasks, but actual agents that require multiple steps to offer some task completion.
And so, yeah, this is, more kind of advanced way to evaluate our LMS, specifically if you want to look at them as agents.
Yeah. And one of the, the big challenges with this sort of thing is usually when you evaluate agents, you kind of have a success criterion, which is like they either successfully completed the whole task that you're testing for or they failed. Right. And the challenge with that sort of arrangement is often the tasks that you're evaluating for have many steps.
That's kind of the point of an agent, right? They take a complex instruction, they break it down into substeps, and they form each of those sub steps out to a different instance of themselves or a different language model. Now the challenges. This gives you a really low resolution picture of what's going on here,
right? If one agent gets 90% of the way through a task and just fails at the very end, and another just feels like falls flat on its face out the gate, you're actually not measuring the difference between them. And so what they're doing here is taking a bunch of different tasks. They have nine unique tasks and over 1000 example environments where they can test these these agents.
And in each case they have, sort of a bunch of substeps annotated manually, by the way, for each of these, each of these tests that allow you to kind of have a progress rate metric, not just the measures like, oh, did they succeed or did they fail overall? But, but at the higher level of granularity, what are the steps that they actually achieved? And this metric apparently reveals significant progress
that otherwise, you know, you might have missed. So when you look at their success rates, you can actually see like, oh, interesting. You know, it. On the surface, it looks like these models are doing comparably, because they're both getting really low overall success rates. But like one of them is getting 70% of the way through task and the other is just 20% or something. So this gives us a high resolution picture of what's going on. They also make a bunch of different draw some conclusions
about which models are best. No surprise here. GPT four powers agents best. But the next kind of leading model in among open source systems is, I believe it's a Chinese model called Deep Seek. So that's kind of another interesting note. And that will have changed, of course, in the week or two since the paper came out, because we've had a bunch more open source elements. But anyway, it kind of interesting, new strategy for measuring progress. In, in agent LMS.
Yes. And I think indicative of, one of the major kind of open problems or research directions, research and development directions throughout the field of getting from models to agents and, you know, various examples like the Arc browser, for instance. So that's an instance where you want your LLM to sort of be an agent to some extent. So, yeah, it's indicative of interest in that direction to get a whole benchmark and a fancy way to measure things, to be able to tell if we're making
progress. And one last paper for the section, Specialized Language models with cheap inference from limited domain data, this one coming from Apple, where they show that you can customize a large language model, essentially specialize it to a particular task to make it cheaper to run. So that's the high level summary of a talk about the specialization budget. We're training on a different domain that is a little more specific in general, and things like reducing the inference budget.
So given this more specific target task that you want to achieve, they show that how, you know, a certain way of being able to do that. Okay, onto policy and safety. Starting out with a use. I pass his last big hurdle on the way to adoption. So if you've been listening that just from the title might be kind of a big deal.
Where you are actually has been in the works for years and is, really, really large, regardless of the effort that we you has been putting together for quite a while and for the last few months, there's been a bit of, issues with it in terms of what to do with alarms and open source. There was some back and forth, some problems, but it, kind of got resolved, I think a month ago or a couple months ago.
And now. Now, this is reporting that the last major hurdle has been cleared for it to be voted on. And what that last hurdle is, is actually finalizing the proposal. So the final text of a U. I act has been approved in a vote. The act itself hasn't been approved. It will need to still be voted on by all these EU member states and
so on. But the text itself has been and that has been sort of what has been the effort for quite a while now for the last few months, and even before that, for a while, just agreeing on what the proposal is. So with that having been done, it seems like the way is clear for voting to happen and presumably it will be approved. So it's kind of like a matter of not too much time until the act is done and actually becomes law, which is we can.
Finally start talking about, you know, I thought.
I know what it was. Why is this taking so long?
But this story arc. Yeah.
Yes. So it's. Yeah, quite an important deal. This is, as we've said before, kind of one of the biggest regulatory efforts worldwide in terms of doing a lot of stuff at at a high level, establishes different risk categories for different applications of AI and, creates different requirements for companies and developers of AI to, follow depending on the risk of said, application.
Yeah. One of the really kind of controversial things that kept everybody up to like late hours during negotiation process towards the end, as you might recall, is, the idea of foundation models and how general purpose models were going to be dealt with, if at all, by the AI act. It turns out that they now the act includes a provision, specifically for those kinds of systems. And so I guess that's that's what's being adopted or at least approved to the next stage, at this point.
So kind of a bit of a win for the kind of Max Tegmark, I think it was like Yoshua Bengio, Stuart Russell crew, Geoff Hinton crew who were pushing for the inclusion of foundation models in the regulation because I believe it is currently in their.
Now, as with any one of these laws, right? Big regulation efforts, even once it's passed through, is going to be a phased entry into, actually being enforced. So, for instance, for these foundational models, general purpose AI, those rules don't apply until 2025. So the actual impacts of EU act will sort of start to gradually come about once it does become law. But yeah, I think finally the saga of the EU AI act being put together is done.
And up next we have building an early warning system for laminated biological threat creation. This is actually a blog post out of OpenAI. And the question they're trying to answer is do large language models like GPT four meaningfully increase the, extent to which people can create or access or develop biological threats?
Right. So this is, a really important, policy issue currently via the White House's executive order that came out a couple months ago, explicitly has a carve out for, biological threats generated by AI systems. There's a reporting requirement if you're, you know, training a a model on biological sequence data with, like, ten of the 23 flops or less or something. Basically, they're really concerned about this, this possibility, this use case, for language models as well.
So that's what they're, they're trying to, to test here. This, by the way, is on the heels of another, piece of research that was put out by the Rand Corporation. If you don't know the Rand Corporation, they're actually really important in policy circles. They are, a, a company that partners a lot with the US government. They're actually leading the implementation of a lot of the white House executive order stuff on AI, model evaluations.
And they concluded that loans do not increase information access that's relevant to biological threat creation at this time. But and this is a big but they did not have access to the research only version of GPT four. When they did their study, they had access to the version of GPT four that you and I can access the version that will say no if you ask to, you know, get help designing a bioweapon.
And so, what OpenAI is doing here is they're basically I will get into the details of the study too much, but they created two cohorts. One was a bunch of PhDs who really knew their biology. The other was undergrads who kind of knew their biology. To the PhDs, they gave access, they gave the PhDs rather access to the research only version of GPT four, the full kind of the whole enchilada. And then the the undergrads, were not given access to that one. What they found this is really interesting.
They divide the process of creating a bio thread into five different stages. And at each of those stages, they try to measure like, does access to GPT four give this cohort, a leg up compared to people who can only access Google or like the internet. Right. And for each of these stages. So there's like the initial ideation stage all the way down to the execution stage. And what they find is really the issues. They don't have enough data. They have only 50 students or 50 people in
each cohort. And they find that, they're not they don't get statistically significant results. They do see an uptick in effectiveness, at making the biological threat, but they don't see a, statistically significant uptick at any one of those individual five stages that they test. But if you zoom out and look at the process as a whole, the results are statistically significant.
They do see what appears to be an uplift in what they call total accuracy, the overall ability to generate one of these threats. And so what's interesting here is that this result is meaningfully different from what we saw from the Rand Corporation. Previous results. We actually do see an increase increased, by the way, measured on a scale of 1 to 10 is a factor of about one point
for the PhDs at least. So they find that they can take a PhD from, you know, a seven out of ten dangerous to an eight out of ten dangerous with this approach. I think this is really interesting for two reasons. The first is it suggests that we're not constructing a serious auditing process. If we don't give auditors access to the full research only
base model, right. The Rand report, the Rand Corporation came up with a completely different and much less concerning conclusion, because they did not have access, the level of access that would have been needed to to do this properly. If opening the eyes results here are correct. So that's one piece of data, I think, for policymakers to be tracking. Like, yes, it does really matter. You do need to give your auditors a lot of access to some sensitive company IP.
The second piece is when you think about the history of language models and capabilities that have emerged. If GPT two was like barely able to do language translation, like it would kind of like, you know, you'd be like, oh, yeah, I could see that's a French word for whatever, you know, like it would give you a little bit of stuff that it was not really good. GPT three with a little. A bit more scale. In other words, a little bit more progress.
All of a sudden, boom, this thing could actually usefully do language translation. When it comes to scaling, you tend to find a little tiny hint, if you're lucky, of a capability that emerges a barely noticeable inkling of it, and quite often at the next level of scale, boom! You hit something reasonably close in some cases to human level capability.
And so when you start to see just like a little bit like we have our first hint here, our first shot across the bow, this is a statistically significant result, at least at the overall, total accuracy level. I wouldn't be surprised if GPT five shows the research. Only version at least shows a marked increase here.
And I think that this now puts us on a scaling trajectory to plausibly, plausibly, it's not guaranteed, but plausibly introduce the very kind of bio risk that the white House executive order was, was concerned with. So I think it's a really interesting data point. Kudos to OpenAI for putting together this study. And, and it does invite us to think about, you know, how we audit these systems, the level of access we need to give to auditors as they do it.
They do say that this is considered a starting point for continued research and community declaration, and this is done as part of their preparedness framework that we covered, I think, about a month ago, where they highlight, basically, as you say, like being prepared for the possibility that as you scale and improve these models, different risks, like for instance, biological threat creation arise.
So yeah, it seems, as you said, very notable that OpenAI is continuing to push in this direction in a pretty significant way. I mean, if you look through this, this is a big research study on this particular topic, and I'm sure they have other initiatives related on other vectors like cybersecurity and not just, biological, threats. So, interesting new insights here and kind of makes me feel a little more I don't know about generally for people who are worried about safety and scaling and so on.
But I feel like, you know, the fact that OpenAI is investing in such a level of preparation and understanding of what might happen is definitely a good sign in general. Had a lightning round. First up, FCC votes to ban scam robocalls that use the AI generated voices. And that is the story as we covered, I think last week there was a robocall aware that President Biden talked about free elections. And so this is basically following up swiftly on that.
It is now outlawed to have robocalls that use AI generated voices.
You can't do anything for fun these days.
I know, I know what how restrictive. We can't even have scam robocalls anymore. But, yeah, it's, you know, probably a good news for all of us because I'm sure there would be a flurry of these sorts of things happening if there's no kind of restrictions in place.
But next, we have a Biden administration names a director of the new AI Safety Institute, which, by the way, there's an AI Safety Institute consortium that was recently announced as well. And I am proud to say that Gladstone AI is now a member. Not not anyway, I guess not. Not new news now, but, still. So yeah, the Biden administration has appointed, Elizabeth Kelly, who is a top white House aide, as, the new director of this AI Safety Institute. This is a follow on to the executive order.
It's based at Nest, the National Institute for Standards and Technology. The nest is charged in the executive order with, like, doing a whole bunch of stuff around AI model audits and evaluations, testing and evaluation. So she had formerly been an economic policy advisor, Joe Biden, and she was actually really important in drafting the executive order that established, well, the executive order, I should say it's the one that established the
institute. So, yeah, I mean, I think it's, it's going to be interesting. She's got a big, big, set of shoes to fill here and, hopefully, you know, one thing to keep in mind, by the way, this executive order that established, the AI Safety Institute, this is probably going to be, stricken, like the EO is probably going to be, ripped out if, if Donald Trump is elected in the next electoral cycle. So I'm curious what ends up happening to some of these
institutions after they're set up. I feel like I should know the answer. But, but yeah, it's it's going to be one of those interesting kind of political questions as to what what comes next. And, I'm also sorry, I sidetrack, but, I've, I've heard, I think Trump make some noise is about how he's concerned about about AI, and sort of considers, some of the risk, the threat models, pretty
seriously. So, that's sort of interesting, encouraging if, if only because it means that, you know, we at least have some degree of attention on this, issue, no matter who wins in the next election cycle.
Writing next story. OpenAI stood before. Finally meets its match. Scots Gaelic smashes safety guardrails. That is a somewhat confusing headline for the story, and what it means is that you can get around the safety guardrails that prevent you before from outputting, lets say, harmful things like misinformation or, I don't know, scam email templates or something like that. You can get around those by translating prompts into uncommon languages like Zulu, Scots Gaelic
or Hmong. That's the research result here from Brown University, and is yet another of these prompt hack attacks, as we're called, that, has been uncovered.
I'm surprised at how well it worked, and I'm also surprised that this is the first time we're hearing about, like an academic study on this. But, apparently they were able to bypass safety guardrails about get this 79% of the time using these languages. So it's a comparable to other jailbreaking methods, but those methods tend to be like way more complex, way more technical, and harder to pull off.
So the fact that you just pivot to a different language and make your request is, that's something I mean, again, you know, for I should mention to the comparable. Right. If you ask those same prompts in English, it turns out they get blocked 99% of the time. So this genuinely is a huge, huge delta. You're going from, you know, 1% success rate to 79% just by changing your language to to Zulu or Scots Gaelic and so on. So pretty cool.
And on to synthetic media art. Our last section were a few stories. First one is AI poisoning tool. Nightshade receives 250,000 downloads in five days. This nightshade is a free downloadable tool created by researchers at the University of Chicago, which is designed to be used by artist to disrupt AI model of screen scraping and training on their artwork without consent.
So the idea of poisoning here is that when you upload your images that you've created, you can run it through nightshade and it would mess with your image a little bit or potentially a lot, in some cases to make it. So if you were to use it in your training data, that would be bad for a resulting model.
And so the amount of downloads, 250,005 days is really indicative that there is still a lot of resentment and, you know, a very concerted effort by artists to fight back against AI and in particular, like this free use argument that any image online can be scraped to be used for training. It seems like a lot of people are very much opposed to that, and this is one way to express that.
Yeah. And in the same team apparently had earlier made another tool called the glaze, which works to prevent AI models from learning an artist's signature style by subtly altering pixels so they appear to be something else. So basically, you know, you try to make it harder to latch on to the style that, that an artist is using. That one received 2.2 million downloads since its April release. So this is also very, very popular.
And what they're now working on as a tool to combine that, the kind of more defensive strategy that glaze uses to try to make it harder for an AI to catch on to your style with, with nightshade, which is maybe more offensive since it's a poisoning strategy. And they're also saying that they're going to be open sourcing a version of nightshade, pretty soon. Two or at least they anticipate doing that. You know, I thought this was interesting. VentureBeat, which is where this
article is from. They said in the article they actually use some of these tools, to create article imagery and some other content. So kind of interesting, metagame there.
Next up, labeling AI generated images on Facebook, Instagram and friends. This is an announcement from meta. There is going to be an imagined with AI label. Any photorealistic images created with any of the meta AI features. And now you can have image generation in various places on Facebook, Instagram and threads.
These are visible markers, but there will also be invisible watermarks and metadata embedded within the image files so that, you know, presumably people can try and delete the visible markers that are generated. But we are still going to be these invisible watermarks for software tools to be able to try and detect. And also the metadata that if you just read the file, you will be able to tell, let's say I generated.
But Andre isn't all of Facebook's data. Metadata. oh. Anyway, Facebook is also adding a fairly, feature where, so they're going to actually allow users to disclose when they share AI generated, material. And I think this is actually kind of interesting. Not because I would expect it necessarily to, you know, solve any problem associated with real time sharing. So the issue is, you know, you share something and then it goes viral and then, you know, issue a correction later or whatever.
But so first off, this is is interesting in the long run because it gives meta access to a data set of of tagged images that are known to be AI generated. So probably allows them to do a little bit more, more effective detection. The second pieces, they may not be able to catch you in time, but if there's the threat that, hey, we're collecting this data set, we're getting better and better at identifying those images. We may, in retrospect, be able to determine that you shared something
false. And if that happens, you know, you at least have the threat of future sanction implicitly imposed on you, by this process. Probably not going to be enough to, like, stop bot accounts and stuff like that. But, you know, it's it's a decent strategy, given where Facebook is at, what their exposure is to this risk, which, you know, very, very high.
The invisible watermarks. We also point out in this post that they are collaborating on the standards that are used across the industry, and one of them is C to pay for the content provenance and F and tricity standard. So it it looks like more than likely it feels like industry as a whole is kind of converging on the use of invisible watermarks, metadata, all this sort of stuff. And this backwash from media is indicative of that happening.
And in fact, in a lightning round, our very next story is OpenAI is adding new watermarks to Dall-E free, and it is pretty much that same story. They're adding A vs C to EPA coalition for content provenance and authenticity watermarks in images generated on, OpenAI's website and will be starting, I think, this next week. So yeah, it's kind of exactly that same thing that, we just covered with meta, and I believe this is already also being done by Adobe and Microsoft artists.
We are also in this suit to see to IPA coalition. So yeah, there's now a standard that probably, I imagine in the near future, if you look in an image on Reddit or Twitter or something, potentially it will be looking for these watermarks and actually notify you that this is AI generated based on the existence of the watermark or some metadata that is now becoming the industry standard.
Yeah, it ain't worth noting. It's also not super a coincidence. This is all happening right now. All of a sudden, the executive order does require, some measure of investigation, like looking into these sorts of solutions, if I recall. So, you know, these are companies that are doing the right thing here and trying to get ahead of, of any, you know, exposure they may have to that process. So I think, you know, not surprisingly, meta OpenAI and all these companies following suit.
And on to our very last story. This one is not too significant, but it is a follow up on some previous stories. It is that the AI George Carlin comedy special that caused some controversy. And later we covered, I think last week there was a thing in Iceland that was kind of similar. Well, that AI comedy special was actually human written, following the lawsuit. The person behind it went on to say that despite claiming it was AI written, it was actually human written.
And that's a first. Yeah. Interesting development. And yeah, it's kind of funny, like people passing on things as being by AI because of a novelty, then turning out that, maybe that wasn't a, I was maybe used partially, but there was a lot of human involvement and massaging it to make it what it was. And that was presumably, kind of a case here. So, yeah, no, no huge consequences.
But I think worth just mentioning for people who've been listening and following the story that now there's a new development in it.
Yeah. The biggest takeaway for me was that, the specials title is George Carlin Colon I'm Glad I'm Dead, which, like, you know, thanks to generative AI for, I guess, giving us comedy specials with those titles. Now, that's a that's a new genre. I thought that was really good. So, yeah. And then I'm curious to see what, what actual, I guess can do with George Carlin stuff in the future. But, we'll have to wait to see what the copyright rules are for that, too.
And with that, we are done with this latest episode of last week in AI. Once again, you can go to asterisk in the I for the text newsletter that covers even more eye news. If somehow this is not enough and you can email us at contact at last week and that I have any suggestions for the podcast or our newsletter. And you can also email hello at Gladstone Dot I to I guess I asked you, are you on a job or just, chat, you know, whatever.
If you are, we would appreciate it if you review a show, if you share it, if you make us more famous and reputed and so on. You know, that sounds cool. Yeah. That's, always nice, but. Yeah, no pressure. Really. What we care about is that people actually get a benefit from us recording this, so please keep tuning in.