News is dropping, so let's get in tune, AI brought to innovation so soon, X unleashed through the images, fly, flag on marble, we're reaching the sky. And that's it for the
intro song. By the way. I use like, uh, the same song for the intro and the outro, and it's clipped at the beginning. And, uh, you can hear a full version of that song at the end in case you didn't know. Uh, but anyway, welcome to last week in AI, where you can hear us chat about what's going on with AI. And as usual, we will summarize, uh, last week's most interesting AI News. I am one of your hosts, Andrej Korankov.
I finished a PhD at Stanford, now work at a generative AI startup, and apparently am not the best podcast editor.
Oh, wow. Dude, don't, don't be like that to yourself. You gotta, you gotta show yourself more appreciation than that. We're millennials. Okay. Let me explain what just happened. So I get an email this morning from, uh, from some great listeners of the show, uh, who were like, Hey! Um, uh, you know, the latest episode only has your voice in it, which, um, is not, if you're curious, like that is not the best part of this show.
Uh, so Andre has been scrambling to fix that and there's going to be a bug. Well, obviously by the time you listen to this, there's going to be a bug up or a fix up, I should say, hopefully not another bug. Um, and there you go. So Thank you. Andre is a one man editing shop, by the way, so we record these episodes and then he gets cracking on the editing.
So the fact that this is the first time something like this has happened in a long time, I think is, is a hell of a statement, Andre, to your, your dedication to the craft.
With that being addressed, uh, FYI, that's Jeremy speaking. Oh yeah,
that's right. Yeah. If you know, you know, I'm yeah, I'm Jeremy. I'm the co founder of Gladstone AI, AI national security company, blah, blah, blah. Um, yeah. Welcome to the show. I guess that was a, an unusual kickoff
and, uh, real quick before we do dive into news and unusually light news week, actually this week. So this might be a bit of a shorter episode. I do want to say I've been kind of, uh, uh, mentioning the Apple reviews a bit and wanted to get to a nice round number of 200 ratings. We reached 201. So I've crossed that threshold and now it's not a round number anymore, but thank you to those who reviewed and thanks for the feedback.
Uh, Lightspeed226, uh, has some good comments, enjoying the podcast, uh, likes Jeremy explaining AI architecture and papers where we get into the nitty gritty of details. And I do think that's one of the nice things we enjoy doing, uh, and, uh, did say prefer less geopolitical stuff in China, which, uh, we, uh, will, you know, take that into account. It's good to hear what people prefer. And this week. Unlike most weeks, we actually, I don't think we have any geopolitical stuff. I feel like
there's like one. There's one, if I recall, it's about China. Surprise, surprise. There's always going to be one. There's, there's another review, uh, which I find especially intriguing because it's from Q, Q, Q, Q, Q, Q, Q, Q, Q, star. So that's like probably open AI is next, next, next, next, next model that is already running rampant, most likely an agent I'm assuming on the internet just says, I love this podcast. So, um, There you go. That's a good catch.
I did not see that username and that's pretty funny. Can we say that this is an open AI endorsed podcast now? Is that what this allows us to say?
I mean, Q star for context, if you don't know, it's like a code name for a project at open AI that it's been kind of discussed for a long time. So apparently this user is probably just an AI chatbot and, and AI officially endorses us now. Alrighty, moving out to the news, starting with tools and apps. And the first story is a bit of a big deal. We have Grok 2 coming out in beta and with AI image generation. So Grok is the chatbot that you can access on X if you have X premium and premium plus.
This is the, uh, chatbot, uh, competitor to chat GPT being developed by X AI. and uh, led by Elon Musk. And so this was announced in a blog post with not a ton of detail as to Grok2. Essentially all the blog post said is we have Grok2 and Grok2 Mini and we beat a bunch of other models leaderboard, which was a bit of a surprise. It apparently did outperform McLeod 3. 5 Sonnet. And GPT 4 Turbo. Although, uh, I will note, uh, that is only if you include refusals.
If you exclude things where a model says, I cannot answer, not quite the same news. And as to the image generation features, that's coming via flux. So we covered this recently. There's black, black forest labs, a startup coming out of the people who develop stable diffusion has a really impressive model. Now everyone on X can use it. There are no restrictions or very few restrictions as such say, uh, what you can, uh, generate and people have been generating some pretty crazy stuff. Yeah.
And to your point, I mean, there is not a lot of detail in here, which is kind of odd, right? Because to a certain degree, like XAI is big differentiator and Twitter's big differentiator has been openness, you know, open source and all that stuff. So we're going to have to see how much openness there is around Grok too. That certainly is a big part of the gripe that Elon has with open AI, right?
About turning around on, uh, on their sort of former commitment to openness, at least as many people interpreted it. One of the few things I was able to pick up from the blog post based on their description. For what it's worth, it does look like they actually did use, um, something like reinforcement learning from human feedback, maybe PPO, something like that, uh, to fine tune their models.
They talk about getting these AI tutors to provide preference data by ranking outputs that they get for the model. Sounds an awful lot like the RLHF. Uh, fine tuning process, uh, maybe, maybe not surprising. I mean, it is pretty standard fare these days, but it is also interesting because Grok is supposed to be, you know, more unhinged, less, less like fine tuned to respond in certain ways.
So at least we know there is in fact, that step in the, in the pipeline, again, not shocking, but still kind of something you might've been wondering about given the whole kind of commitments around unshackling Grok and all that.
Uh, we're also learning that an earlier version of Grok 2 was in fact, if you're knee deep in the kind of Twitterverse here, the X verse, I should say, um, there was a lot of speculation about this model on the LMSys leaderboard that was posted under the name of Sus Column R. So we are now learning that in fact, an early version of Grok 2, uh, I was sus Colomar. So there you go.
Uh, that's where we saw it outperforming, uh, Claude 3. 5 sonnet and GPT four turbo with all the caveats, Andre that you included, you know, again, it's like the, the differentiator of this model is that it will respond to any request, right? It's supposedly a lot less safety, fine tuned, a lot less, uh, inclined to, To decline to answer things.
We're also seeing that obviously with the image generation feature, a lot of people commented on that, you know, there's concern about the usual sort of misinformation concerns that you'll see people surface, um, in this context, especially interesting because there's absolutely no constraint right on what, uh, black forest labs is flux. Uh, one model is actually being allowed to produce through the Grok interface.
We've seen, you know, images circulate on social of like Donald Trump kissing Elon Musk and uh, doing all kinds of stuff. So it's what you'd expect, basically. Um, some details about, you know, the, the enterprise release of Grok 2, which apparently is imminent. That's something that, Uh, I thought it was kind of interesting, you know, uh, x certainly leaning into the enterprise play and there's a tighter and tighter integration with, uh, X AI and X, or what I'll say formerly Twitter and x ai.
They're certainly seeming to, to work much more closely together. So, uh, really interesting, um, to see both the capability of this model, uh, some of the context around it, and also the fact that there is so little data available. We'll see if more comes online in the coming weeks.
Right. Yeah. I mean, we've, we've gotten used to with these announcements, uh, from big companies to not include many details, but even relative to that, even relative to what we hear from OpenAI and DeepMind and so on, this had even less detail in the model. And I guess we are downplaying a little bit in our coverage. I think the big thing that media and most people have been paying attention to is the image generation.
So Grok2, I guess, from a technical perspective, from an AI progress perspective, is more interesting. And FYI, it does do well on benchmarks. So even aside from the leaderboard where humans rank models, and it might get a benefit from not refusing anything, The benchmark numbers are also pretty impressive, but yeah, as you said, the image generation, you end up having like Mickey Mouse smoking weed and drinking beer. You get like crazy things.
You get like Mickey Mouse with a assault rifle and like. People doing school shootings, uh, you know, these kinds of things. So, uh, an interesting bet that, uh, you will, we'll be able to get away with this. I think.
Yeah. And, and, you know, this, I mean, the flux one model, um, and this may have been the podcast that you did with, uh, with John a couple of weeks ago when I wasn't there. So apologies if this is somewhat repetitive, but it is an impressive model, right? I mean, it apparently does surpass mid journey and opening eyes. image generators in terms of the quality of the images, according to like basically an arena ranking, a kind of ELO based scoring.
So it does seem legitimately to produce better outputs. And there's also apparently a text to video model coming soon. So we'll see if, uh, if FluxOne's Uh, text to video model ends up being incorporated into the XAI, uh, sorry, into the X, into the former formally X formerly known as Twitter interface. Cause I think that'll be a whole other kettle of fish at that point.
That's right. And, uh, I will comment qualitatively, as you said, the model is impressive. Like when I was. Yeah, looking on Twitter and seeing these new images, I was pretty taken aback by the quality. And at first it was like, wow, did they actually train a model of this good, uh, internally, but no, it was, it was using an existing, uh, model from a very, uh, promising startup. So if nothing else, you know, we were seeing some progress in image generation.
And speaking of model updates with few details, the next story is OpenAI reveals an updated GPT 4. 0 model, but can't explain how it's better. So, uh, that's kind of a gist of it. There was kind of a weird announcement from OpenAI on Twitter where they said, by the way, we've been serving a new improved chat GPT, uh, version. variant since last week, but that was kind of all the details we got.
And they updated the naming, uh, to chat GPD 4. 0 latest, and have just said that it has bug fixes and performance improvements that the users tend to expect. Prefer, uh, so, you know, hard to say, it seems like maybe opening eye is trying to move a bit faster, trying to keep up with the news. And as we'll get into in the next story, Google is putting a bit of pressure on them.
Yeah, absolutely. And you know, there, there were some early indications that something like this may have happened. Um, you know, Andre and I were like, we were just talking before the, the show, before the recording, at least of, uh, about the kind of drama in the Twitterverse and the experts and how uh, Anyway, if you know, you know, but there's this account called plenty, the prompter, and he's been doing all kinds of stuff.
But one of the things that he was pointing out, um, uh, last week was he noticed a difference, uh, between the opening eye of like a week ago and the new, sorry, the chat GPT, uh, sorry, the GPT four over like a week ago and the current one. Um, so he was kind of early to, to the, to the game here and plenty of the prompter. We've talked about him a little bit in the past.
This is an account that does incredibly good jailbreaks of, um, of language models across the board, not just opening eyes, language models, but anthropics as well. Um, so, you know, this is at least kind of noticeable to people who are really, really good at prompting strategy. Uh, and, and he certainly is a bit of a wizard on that stuff. So, uh, Uh, other than that, yeah, it just seems to be, you know, it was the, the product of a lot of speculation.
A lot of people were, were wondering, you know, is this update the famous strawberry, uh, model, which of course, um, anyway, it'd be, it'd be quite shocked if, if that were the case, it isn't, uh, at least a highly detectable upgrade, but certainly a, an iterative improvement.
Next up, that Google story, and it is that Google Gemini's voice chat mode is here. So, uh, Google just held an event for the Pixel 9, one of their smartphones, and as you've seen, they focused a lot on AI, and it seems that Gemini Live, that voice chat mode that is similar to, uh, the chat GPT 4. 0 demos. We had real time conversational interactions with a chatbot with, you know, very fast response times, very natural sounding voice from the AI.
That is now coming to Gemini Advanced subscribers. And in addition to talking, it can also interpret video Uh, again, similar to what you've already seen with demos in the past, uh, Google also showcased this a few months ago. In this, uh, in, uh, in this release, uh, you will be able to choose between, uh, 10 Gemini voices. And currently it's only available in English for Android devices. So there you go, Google, you know, in this case, we've seen some delays from OpenAI on.
Expanding access to the, uh, voice mode and of course, Apple, we've just covered has, uh, delayed Apple intelligence. So I'm sure, you know, investors in Google are happy about this.
Yeah. I mean, you know, depending on how it turns out, of course, one of the advantages of rolling it out first, as we've seen with open AI is customers can generally be a bit more forgiving because you are the first time they're seeing this capability. And so when they see failure modes. Or they, when they see cases where it's being misused, they can kind of go, eh, well, you know, fair is fair.
Is this part of, uh, part of the sort of, uh, learning process, if you will, with Google in a more established company, people probably will have less of that patience, but opening eyes as well as kind of maturing as well. So we'll, we'll see. But um, yeah, it has this interesting, uh, ask about this screen mode as well that they're advertising or ask about this video. So it can give you information like. Pulling data from the screen that you're looking at, which is kind of interesting.
And another way of interacting with this model, they also say more languages besides English are incoming. So presumably that'll be soon. Um, yeah. So we'll see if, if this, uh, leads to a wider release of voice mode, uh, at open AI in the, in the coming days or weeks
and out of lighting round, just one more story related to GemII. It is that, uh, in addition to that announcement of, uh, the voice chat, they also announced that Gemini is coming to the Pixel Buds Pro 2, their, uh, Bluetooth enabled, uh, headphones, very similar to the Apple one. I forget what it's called. Uh, and so the idea here is you'll have access to Gemini. to have that sort of conversational back and forth directly through the headphones.
And I think this is mainly worth highlighting because we have been talking about all these different hardware releases like the Humane AI Pin, the Rabbit R1, things that come out and promise to like deliver AI via hardware device and aptly failed. This to me is like kind of that idea, right? Of like, this plays music, but also you can chat to an AI in real time as a hardware device. And, uh, if anything has a chance to succeed, I would imagine something like this would make sense. It's true.
Yeah,
it is. I mean, you know, it's, it's part, not only of a long line of kind of disappointing releases on the hardware side. You know, humane's AI pin rabbit are one we've talked about both of those, but also specifically with Google, you know, they, they, you know, going back to Google glass have had a rough time with some of these releases. I think, you know, at this point, uh, this is just a tough nut to crack, but you know, whoever does, maybe, maybe there's an iPhone moment there.
I'm sure there is at least one iPhone product worth of, of stuff in this space by now. So, you know, we'll have to see. It's interesting when you look at it kind of looks like, um. Uh, if you've ever had a like an air tag, um, if you imagine an air tag, that's like half the size that sits right inside your ear that you can kind of see it's like a, like a little, if you get a white one, at least it's a little like air tag like thing. Um, so, you know, it's fairly discreet.
Um, but, uh, but it is going to be something to get used to as people, you know, start to, to, to. Talk to themselves in, in this way, or rather, rather talk to AI systems in this way. They do say they have a new chip that's made it possible to reduce the size of these buds by, they say 27%, um, with, uh, faster processing speeds and as well, a battery life increase, uh, to apparently 12 hours. And then, uh, that's on the buds.
And then with you, when you combine it with the charging case that comes with it, 48 hours as well. So a lot, a lot, a lot of focus on the side, you know, edge device deployments. The, the overall, uh, thing will set you back 229 bucks, uh, and they start shipping September 26th. Hashtag very much. Not an ad. We have no idea how this is going to turn out.
And, uh, but you know, their, their whole, the marketing around this too, we've seen this with another, uh, like kind of set of releases in this. base is, you know, they'll talk about it like, Oh, this is like an, an intimate companion or a close confidant in this case. It's interesting. Like, I'm curious how that's going to resonate with different demographics.
You could see that being a little bit creepy as a thing to introduce, but I mean, I think, you know, talking to my friends, like San Francisco, Andre, I'm sure it's the same for you. Like people are a lot more open to this kind of thing and easing into a her style movie ever so slightly every day.
Yeah, I would imagine if there's an audience to try this with, it would be people who have the Pixel, you know, uh, flagship phone and can buy 230, uh, headphones. That's a good point. And now one more story on Google, uh, not on that event. This one is about their AI generated search summaries and that getting a small update. So they are expanding them to six new countries and also changing the way that citations are shown.
So before they were included in the text of the little summary you get that answers your question with sort of a, um, chatbot generated summary of different sources, websites that it can have access to. So it was kind of, uh, not easy to see those links necessarily before. Now they're making that much more prominent, uh, like to the right of the text you see.
And, uh, This article does say that early testing of a feature is indicating higher traffic to publisher sites, which I'm sure publishers will be happy with.
Yeah, I think, you know, this is interesting because the, you know, we talked about this earlier in the context of SearchGPT that, you know, when you read the description of this new take on search summaries, It sounds quite similar, right? You had that tab with the kind of websites that are being referenced. And then you have the main thing with the AI generated explanation. So at the time, we talked about how, uh, this is some kind of fundamental shift away from just the, the user interface.
Look, the design of traditional search is going to have to happen. You're going to have to change the, The search experience in a fundamental way. If you're going to take on Google, because that product really does seem so optimized, right? Bing didn't have much success out the gate, trying to compete with Google, trying to take away market share, even when they integrated GPT four. Um, and so, yeah, you, I mean, you need to do something significant to rattle the cage.
Uh, this is Google seemingly, I mean, jumping on what opening eyes doing with search GPT and saying, nah, we're going to cover this base as well. Um, it'll be interesting to see. I mean, there are a couple, couple of kind of unique. Aspects of generative AI based search that are clearly leading to different product thinking in this case. So, in particular, you know, there's this whole idea that they want to make it easy to save an AI overview that's generated.
Now, there's no notion of saving a. like a traditional Google search, right? You do a search, you get the thing. And then it's not like you want to save that list of links. Well, in this case, cause we're using generative AI, you get actual content. So, you know, saving becomes a thing that is, that's more relevant. There are a whole bunch of different, you know, and there's also a button they're setting up that allows you to simplify AI overviews.
So if you get a response, that's like more complicated, they're allowing you to just hit a button, Get a more simple version of that response back. So all of these are really interesting user experience experiments that we're seeing play out right now in real time as people just try to figure out what is the right user experience for this category of product. So we'll see there's a launch. There's been a launch, obviously. Of AI overviews in the U S that was back in May.
Of course, we'll know that fondly. I remember it fondly because of the rough start it got off to where we found it telling users to put glue on pizza to help the cheese stick and eat rocks and all that stuff that we talked about at the time. Um, but now we're seeing a wider release, uh, to the UK, India, Japan, Indonesia, Mexico, and Brazil. So this is a wider rollout in addition to all these new features.
And now one more story going to entropic now prompt caching is now available on the entropic API. So caching is something I think we've mentioned once before, the idea is that if you use the same introduction to your context, you have some sort of, uh, Instructions that you reuse all the time as you use Claude. Well, now you will be able to cache that and reuse it, meaning that it will be cheaper and faster.
So, uh, when you do write to the cache, where it's going to be increased, uh, input token, uh, Cost, but then you will be able to benefit from that and have pretty significant, uh, cost efficiency and, uh, processing speed improvements. So, um, you know, probably not too exciting for the usual users, but for AI developers, I think this is. Definitely a pretty significant feature.
And I'm kind of surprised that Anthropic was the first one relative to others like, uh, open AI, notably who released this
feature. Hey, but haven't we been saying stuff like that quite a few times lately? And it's kind of an interesting turnaround. Um, yeah, apparently. So I think this is a significant, very significant result for developers. Um, you know, it is. As you said, it's a bit more expensive. So it costs 25 percent more than the base input token price if you're going to actually write to the cash. So just 25 percent more.
But then when you're using that cash content, it costs only 10 percent of the base input price. In other words, you get 90 percent off basically whatever chunk of text you'd previously cashed relative to If you loaded it all up again, every time you were calling the API to try to use it, get it, to use that context, um, a bunch of use cases that this could help with, right?
Anytime you have a really, really big prompt that you're gonna reuse, say, you know, if you're working with a code base, you got a big code base, you wanna make updates or changes, you don't wanna be reloading that whole code base every time when you're doing a q and a session to kind of debug it or ask questions about how to extend it. Um, so that's a, a really big deal. And again, that's, you know, 90% off.
Per query after the first query is a big, big deal, especially when you think about agentic models. That's the kind of context where you would be working with those big codebases. Um, you know, cases where you want to include a whole bunch of examples in your prompt or just really long, you know, pieces of documents, pieces of work like books and papers, stuff like that. Um, unclear by the way, how exactly this is working. That was something that I've been thinking about a fair bit.
I mean, I think there's probably a lot going on here, but, um, you know, you could imagine that the long prompts that are fed in are. maybe segmented into parts that can be reused. So if you've got a large document, it could be broken down into sections or paragraphs or sentences that can be individually cached and reused just for that added efficiency. Um, you know, probably pre computed embeddings is really the way to go here.
So, you know, you can imagine, especially looking at the pricing model, it Does suggest given that initial lift that writing to the cache is more expensive than reading from it. And if that's the case, you're probably looking at a case where there's initial kind of compute costs associated with generating those, those pre computed embeddings. So, you know, uh, I would imagine that's all part of this. Uh, but, uh, we don't have any kind of technical details beyond that.
Um, so we'll, we'll just have to see if they release it, but wow. What a, what an upgrade, you know, costs are down by 90%. They say latency is, uh, is drops by up to 80%. 5 percent for long prompts. So basically you're getting your outputs, uh, you know, like 90 percent cheaper and 85 percent faster in fairness, in a best case scenario for these long prompts. So really impressive next step here for Anthropic.
Now moving on to applications and business. And the first story is a bit of a reiteration. As you said, we did cover black forest labs before. But given the news of XAI, it seems worth to cover this next article. And it is Meet Black Forest Labs, the startup powering Elon Musk's unhinged AI image generator. So, uh, as per the news. Per the title, this is giving a sort of overview of a startup. Uh, so black quartz labs is based in Germany.
It emerged with 31 million in seed funding by super big names in the VC industry. And they didn't announce was Flux one models pretty recently and released the smaller variant of those models. And now I'm sure they are raking in a lot of money from this partnership of XAI. Uh, they do say they plan to open source more as with stable diffusion and stability. I used to do regularly. And as you said, they.
do say that we are working on a text to video model, and this is pretty significant because we've seen a fair amount of work on text to video models, but no one has really matched Sora still. Sora, what we've seen the preview of going back to January, at least in commercial settings, you have very Significant progress with companies like Runway and Luma showcasing a lot of stuff, but no one quite hitting that mark. And I could see this company actually managing to at least try and do that.
Yeah, and with all the caveats, of course, that have to apply to the fact that like, you know, we're not exactly sure what the capabilities of Sora are and, you know, you hear stuff back and forth from people who've had to work with it, but certainly it does seem like they are Ahead of the game. Uh, the, the investors here, you're right. I mean, these are big names. So, and they all have a certain, um, Uh, kind of orientation in this space.
So Andreessen Horowitz, uh, they led the round 31 million seed funding round, by the way, which, um, so if you're, if you're not familiar with the kind of traditional VC circuit, usually if you go to raise a seed round, you're raising like anywhere from one to, you know, I don't know 5 million these days. I mean, that's, that's kind of where things have gone. Um, so a 31 million seed funding round, that's what you would normally associate with kind of a low midsize series a these days.
I mean, that, that's a big round. Um, a seed is significant because it means they're giving away no board control. At least that's typically what would happen. It's only in a series a and beyond price grounds that you tend to see, uh, actual board seats getting given away. So this means that they had quite a bit of leverage going in. And of course you're seeing amazing investors like Andreessen.
Um, but also Gary Tan, uh, Gary Tan, CEO of Y Combinator, uh, who has come out recently as sort of in the, in this accelerationist camp and all that. And, uh, been very sort of bombastic in his support of AI of being against, um, uh, A lot of guardrails and talk of that sort of thing. Andreessen, Marc Andreessen and Andreessen Horowitz. Same thing. So that's sort of the, the orientation certainly of, uh, of this company. Uh, and we see that mirrored as well in the partnership with X, right.
Where they're generating these images that have relatively few guardrails. I mean, there are going to be some on there, but, uh, you know, Presumably not going to see this thing generate, uh, certain kinds of pornographic imagery, let's say. Um, but generally they're trying to reduce the kind of the, the number and extent of safety guardrails. So yeah, really interesting founders, former researchers at a stability AI. You mentioned that.
Um, Andre earlier, so they definitely have a pedigree and um, there's a lot of much ado in this article about the misinformation dimensions of this and you know, everything you'd expect quoting people saying, Hey, you know, this is really bad. We're getting all this, you know, misinformation is going to flood on, on Twitter on X. I mean, at a certain point with open source image generators, I think this was all kind of baked in.
Maybe this advances that timeline by like six months, um, maybe a year. But. You know, it's not a fundamentally different trajectory from the one that we've been on. So, um, you know, you can see those arguments really go, uh, go either way.
Right. Yeah. I think if I were to comment on the misinformation piece, as you said, I don't think this really changes the game that much. Uh, maybe it makes it easier to spread it because you can generate stuff on XAI and then spread it on. X right away. But, uh, most of what people have been doing is making ridiculous, funny, outrageous things, including things like Trump and, uh, I think Joe Biden kissing, like crazy, ridiculous. Did you not? Yeah. Oh yeah. It's big news.
Wow.
Dude, it was a wild
press conference.
Um, And I will say, interesting for me to think about this being based in Germany, given the progress of the EU AI Act, which does have, you know, things related to risks, things related to misinformation. And so this company will be very much enforced and have to comply with those regulations. And, you know, maybe harder to. Be kind of this freewheeling with your model, uh, once the act goes into effect. And it is going into effect in a sort of rolling basis going into 2025 and 2026.
And next up a story about hardware in China. We'll see if we can not comment on geopolitics. Who knows?
That's going to be easy as hell.
Yeah. Yeah, yeah, yeah. The story is that Huawei readies new AI chip to challenge NVIDIA in China according to WSJ. So this is the Ascend 910 C and it is being, uh, tested in some Chinese, uh, companies. And the claim is that it is comparable to NVIDIA's H 100 ships. So the really flagship, the top of the line, uh, of what you can get currently in China and Nvidia. can only sell sort of a weaker variant of the H100, the H20.
And so there's, we've covered how people have tried to smuggle chips into China, like this restriction is a big deal. And so if Huawei does manage to produce a chip that's comparable, we ask this chip, if it is in fact an H100 level chip, that'd be a pretty big deal.
It would be, it would also create a big challenge for NVIDIA, right? So one of the things, one of the games, I'm, I'm sorry, um, uh, Lightspeed something, uh, uh, the commenter who didn't want geopolitics. I'm going to dip into it just for a minute here. So, um, you know, one of the big challenges for NVIDIA has been because of U. S. export control restrictions, they've not been allowed to ship to China their, you know, H100, H200, and then now B200 chips that are coming online.
Um, and so they've had to, to. To pair them down basically and create these, uh, basic, these chips where they'll take the original chip and then there's, they'll zap a bunch of circuits, uh, to decrease the capabilities and they'll send them over to China. That's usually how this works. Uh, the Chinese variant of the B 200 is known as the B 20, um, same as the H 200, that was the H 20 and so on.
So these chips, these chi made for China chips that, um, NVIDIA's exporting have seen actually recently a rising market share as people have seen that they compare favorably. To, for example, the nine 10 B, the, the, the kind of latest Huawei chip before the nine 10 C, which is the kind of next generation that we're talking about here. Um, and so this is really eroding, uh, NVIDIA's market share.
If it's the case that the Huawei nine 10 C is actually on par with the H 100, which would surprise me a little bit, um, it, but if that turns out to be the case, then the NVIDIA's B 20 is completely outclassed and you're gonna see Nvidia really struggle in China. And I think. Quite quickly, basically as quickly as these, um, these new Huawei chips can come online to compete with them. Um, so then it becomes really relevant. Okay. How fast can Huawei ramp production?
Uh, the expectation apparently is they could produce over a million chips next year. If they don't face additional restrictions from the U S that's a big, if, um, you know, the department of commerce is looking very closely at this space. They're they've explicitly told, um, Nvidia in the past, Hey, Stop messing around with your, uh, your chip thresholds to just sneak by our export control restrictions to keep exporting powerful chips to China. Uh, we will crack down on you.
They are looking at this space very, very closely. And so, you know, I would expect them to come down pretty hard on, uh, whatever gaps remain. Now, one of the key gaps here is If you're Huawei, you do still depend as hard as you're trying to free yourself of Western technology. You do still depend on a couple of key inputs from Western sources. One is we talk a lot about the photolithography machines. Those come from the Netherlands. Sure. That's one thing. But another is high bandwidth memory.
The best high bandwidth memory in the world, uh, typically gets made in Korea. There's a company called SK hynix. And, you know, to the extent that Huawei still needs those chips, they are vulnerable to the U S stepping in and saying, Hey, we're going to extend our export controls to cover SK hynix. If that happens, Huawei is going to be in trouble.
So we've seen Huawei in response, stockpiling these high bandwidth memory chips, like crazy in anticipation, basically to this risk of us curbs coming in, uh, to prevent them from, from taking those on.
So the high bandwidth memory, super, super important, especially when you're talking about scale training runs, uh, you need to be able to move huge amounts of data back and forth between your, um, uh, anyway, your, your, your logic, your compute and your, your storage, all that stuff and your, your chips generally.
So, uh, Um, you know, this is a really, really big deal if they can achieve this level of domestic production, no question, but it is still subject to various interventions that the U S government could make. And I would expect if you see this chip come out and you see it shipping at decent volumes, apparently Huawei is going to start shipping as soon as October.
That's the claim you may well see those, um, you know, the U S government coming in and clamping down harder that, you know, it's hard to tell. But, uh, the thing to look at here is going to be basically yields, like how efficiently can they actually make these chips because that's been an issue for Huawei in the past and scale. How quickly can they ramp up? Can they meaningfully compete, for example, with NVIDIA and squeeze them out of the Chinese market?
There you go. You can fill out your politics on your last week and a bingo card. We'll see if we get some open AI drama next. Uh, and some LLM scaling research right now on that. And after the lightning round and the next story also related to hardware and chip manufacturing. And it says that ASML and iMac announced high NA electricity. breakthrough. So currently, uh, you are using EUV tools for the production of like the cutting, cutting edge, you know, very low nanometer.
I think it's like two nanometer, uh, scale, uh, chips that are needed for the most advanced. smartphones and also increasingly for the most advanced GPUs to run AI on. And so high NA is the, uh, kind of next step, the next, uh, most significant technology where you We'll try and continue scaling down. And so they have announced this, uh, breakthrough, uh, progress on making that happened.
This is pretty significant because they are already kind of in, in talks to, uh, release these Intel is reportedly buying all of these machines for 2024 and the machines by the way, cost 400 million. Each. And so Intel is buying these and plans to put them in production in 2026.
Yeah. High NA is a really interesting space. And one day, one day, one day when we do a hardware episode, um, we will discuss this in, in detail, but one of the key things with this tech is, so. If you want to so I have a background in optics back in the day and one of the first things that you learn when you do optics is that if you want to shine to focus light really really tightly you're going to need a very short wavelength of light now short wave wavelength of light needs high energy.
If you're going to do that, though, it turns out you need a really big lens. So there's this sort of combination of things that comes, it comes along with, uh, if you want to decrease the wavelength of, of your light that allows you to etch these really, really small features, you need much, much bigger lenses. And those basically F up the traditional, uh, lithography machine setups. If you start to increase the size of your lenses pretty soon, things stop fitting in places.
Um, and, and by the way, a high numerical aperture lens is basically just a large lens. That's kind of what this is getting at. So essentially this is the game of like, how do we work with this? very, very short wavelength, high energy light. Um, another challenge that comes up there is mirrors. If you want to get good, kind of good reflection. Um, so a lot of light gets lost every time you hit a mirror, especially at high energy. It's really hard to get good mirrors for that.
So that's another challenge. They're making all these custom mirrors for exactly this apparently here. So they achieved first light as they put it, a few. Um, a few months ago, back in April, and they've now successfully printed a bunch of logic patterns using this machine. So they're kind of showing proof of principle. This actually, you know, this has wings. Um, they think that what they can do is, uh, so I'm just gonna, one more little thing here because it is really significant.
Um, um, The fact of having this very high intensity light, this very like bright, high energy light, short wavelength, it also means that you can shine your light just once on a chip and get the pattern to take often, especially when you're trying to etch really, uh, or, um, kind of put in really subtle patterns on your chip. What you have to do is pass over the chip multiple times, right? That's called multiple exposure. or multi patterning. And that, the issue with that is it's slow.
You got to go over the same chip many times, which means you can't ship as many chips, which means each chip is more expensive. And so the exciting thing about high numerical aperture lithography is maybe you'll be able to do a single exposure, ship it, single exposure, ship it, even at these very, very small resolutions at, in this case, 1. 4 nanometer, uh, resolution. Uh, production. That's what Intel believes. So Intel's going all in on it.
Like you said, they bought literally every single one of ASML is high and numerical aperture lithography machines for 2024 way, way ahead of everybody else. There is a risk in doing that though, that they're a little too early. The technology is not quite mature enough. TSMC certainly believes that they're holding off on adopting high numerical aperture lithography for now. Um, because they feel like, Hey, you know, we can hit those resolutions by doing multi patterning.
There's a complex economic argument there again for hardware episode would be great to get into it, but you actually can math it out. And it's actually not clear, like, where is that boundary? Is it, you know, is 1. 4 nanometers where you start to get pay off from high numerical option? All that stuff. Um, it's, it's unclear, but this is, this is the all in play for intel and they have to do something like this because they're so far behind. If this works, um, That's a big win.
Samsung, by the way, also expected to move into high numerical aperture lithography. They have a history of being early, early to need with some of this technology in the past too. They've been burnt before. Um, so we'll see if that, uh, that plays out again here, but, uh, the, the space is rapidly evolving and this is going to be a big, big deal for the chip production runs of the sort of mid late 2020s.
We'll have to try and do that hardware episode before your baby arrives, I think, because otherwise, it'll never do it. Well, the hardware store, it'll just change. Who knows? Yeah, that's true. Next up, Chinese startup WeRide gets snogged to test robotaxis with passengers in California. So there you go. The California Public Utilities Commission is allowing this Chinese company to test in San Jose in nearby areas.
It was founded in 2017 and they are already testing in 30 cities across 7 countries and is the only company to have self driving test licenses in China, the U. S., the U. E., and Singapore and all of those. And that is coming as WeWrite is planning to have a U. S. IPO. So that's pretty significant. They're going to try and raise hundreds of millions of dollars from that public offering. And no doubt something like this will raise their initial cost. Uh, so, you know, we, we keep.
Or I keep mentioning that robotaxis are kind of a quiet undercurrent of this year in AI, and they're starting to really hit the public, and this is another indicator of that.
It's funny, you can always tell which stories are Andre's stories, and which stories are Jeremy's stories. Robotaxi stories, a lot of the image stories, the gaming stories. I think we had good coverage. Um, yeah, this is, I, I, uh, I'm intrigued. I mean, this is also an interesting, you know, anytime you get into robotaxis and that sort of thing, you do like. For people who don't want to hear the national security stuff, I'm sorry.
It's hard not to think of that because people tend to talk a lot in cars. And, um, so, you know, kind of interesting to the extent that you're seeing a rollout of this sort of thing in the US. Um, yeah, like, uh, there are some national security implications, this sort of thing. Um, but in any case, you're right. This is, this is coming for us fast and very soon it won't just be Waymo.
That's right. And, uh, I guess for the context, Waymo is, uh, has been trying to expand into LA and I think eventually New York. And, uh, someone did issue a nice correction on YouTube that, uh, Waymo has had license to test on highways. They've been kind of holding off on doing that and starting to test only with its own workers. So I think we kind of slightly misspoken that front, but, uh, yeah.
Misinformation on last week in AI.
Even we are doing a special form of AI misinformation, you know, not for people to worry about. It's really every people commenting on AI that are doing most of the misinformation as we've seen on Twitter.
Yeah, misinformation about
AI. And next story is that Perplexity's popularity surges as AI search startup takes on Google. So we've discussed Perplexity quite a bit. This is the big player that allows you to do this sort of AI enabled search where you enter a query, it looks up a bunch of websites, and then generates a chatbot response for you that synthesizes all that information. And this article is it. Uh, gathering some new information around the statistics of this company.
They say that they answered 250 million questions in the past month. And that's compared to 500 million for the entirety of the last year. At the beginning of this year, they had 5 million in annualized revenues. And now they're making 35 million. According to a company insider for this, uh, article. And so, yeah, that does indicate pretty healthy growth.
And, uh, we've seen companies in the AI space struggle with that commercialization piece, with that ability to actually have a good business model. This article also mentions that Plexity is looking to get into the ads business. In addition to their premium subscriber, a tier of 20 a month. So perplexity is seemingly going to stick around and we'll see if You know, Google AI overviews and search GPT will be a big problem for them.
Yeah, I mean, this is really a really interesting fundraise. So, um, you know, we're seeing a valuation triple from 1 billion that was back in April to 3 billion. Now the investors. Include soft bank, right? The vision to fund or the vision fund to is now joining in. That's a big deal. You know, they're, they're known for making investments at scale kind of roughly in this ballpark and, uh, you know, not, not, not bad, not half bad. One thing to note, I was surprised by this.
I didn't realize the scale of perplexity success so far. So when they talk about hitting, um, 250 million queries or questions last month, right? So when you do the math, that's actually about 1000 times less than Google. So they're, they're basically dealing with a thousand X fewer queries than Google on a monthly basis. And on a daily basis, I presume, um, that though. So first of all, that's an impressive number.
Um, I know it may sound like a lot, a thousand times less, that is an impressive number, uh, especially they had to start from scratch and nobody knows what they are. Google's a household name. Perplexity is already, you know, one in a thousand of these searches. The other thing is that a search on perplexity is a very different beast from a search on Google, right? These are much more, uh, interesting sometimes. high intentionality searches.
Um, they're, they're much more kind of interactive and imply a higher level of buy into the platform in many cases. So this is actually quite an interesting sign. If I were Google, I would be looking very, very closely at this as an early, uh, kind of an early warning. That's something interesting is happening here. You don't want to take your eye off that.
Um, There are a bunch of interesting notes here coming from Perplexity's, uh, CEO and quoted in the article where he's kind of sharing his thinking on the strategic picture here. You know, he's saying basically, look, Google's distracted by trying to manage a million different products. Open AI, he says, is too. Unlike Open AI, he says, we always knew our main monetization engine was going to be Google. advertising. They knew they were going to pivot into this.
Um, and in my estimation, like what they've done here is, is brilliant strategy. You know, start off with a, a sort of paid model while you have low traffic, 20 bucks a user. That's a great way to go. Then pivot into advertising at scale. That's, that's really the way you do this. Uh, they say they're going to split a double digit percentage of their revenues on every sponsored article with news publishers whose articles are cited. That's kind of interesting.
Again, that's a kind of new business model that we haven't seen before. Of course, they've got deals signed with Time, Fortune, Der Spiegel, a bunch of other publications that we've talked about in the past. And as you might expect, you know, a fundraise like this, the sense of the platform is gaining traction, the sense that, hey, maybe if you can't beat them, join them, is leading other publishers to ask to join their RevShare, the revenue sharing program.
Apparently got 50 people, um, in the two weeks since their launch who've come on and said they want to join that. Um, yeah, so this is really, they're framing this as an attempt to To align incentives with journalism for the long term, revenue sharing. Um, instead of doing this kind of one time lump sum payment, which is what opening eyes has been doing, of course.
When they signed these, these deals with, uh, with time magazine, with, with other other outlets like that, they're saying, Hey, here's a, you know, a flat or a, You know, big chunk of money. Now we get to use your stuff to train our models, right? That's kind of the play here. So, uh, yeah, it's, it's an interesting time to be in the search space for sure. Uh, search GPT is as well, but, uh, perplexity like these numbers just look very, very interesting.
Um, so we'll, we'll see if it, uh, if it continues to grow, but, you know, 35 million in ARR in annual recurring revenue, um, is pretty damn solid. Uh, sounds absurd, by the way, when you look at that multiple, right, they're making 35 million dollars recurring a year and they got a three billion dollar valuation, right? That's a hundred fold increase, like they're being valued a hundred times higher than their, uh, their revenue.
Normally, you know, the lift is not, you know, Then not that big in this instance, it really is because the growth is there and the market is so big and this is just such a scalable business that if it works, you really are looking at something that competes with Google. So anyway, fascinating company, I think a really interesting space to be tracking in general and we will be talking about perplexity more.
According to Google, Bing gets 900 million searches, uh, per, what is it? Per month. So, uh, there you go. It's a, it's a pretty, uh, big chunk, uh, of complexity of other, let's say competitors to Google. And the last story for the section is again about hardware, but this time it is about hardware, uh, I guess, finances and companies. So Lisa Su, who is the leader at AMD, welcomes Silo AI. team after completing the 665 million acquisition.
So we may have already covered the announcement of this, but now it is finished. And so to reiterate, Salo AI is focused on delivering LLMs for large enterprise customers, and AMD is now going to use them to deliver end to end end to end. End. AI solutions based on open standards.
And, uh, to me, this is a little bit interesting because we've talked a lot about how AMD competes with NVIDIA on the hardware and NVIDIA is, uh, to my knowledge, less kind of active in the space of end to end solutions and in the space of providing LLMs to enterprise customers. And AMD seems to be maybe going more in that direction with, uh, this work.
Yeah, no, I think you're, you're read is exactly right. Like we've, we've seen NVIDIA put out research papers, right? And very impressive, um, kind of academic results. And to some degree, like product things that, you know, Microsoft Turing, uh, NLG work was really interesting. That was a collaboration with Microsoft, but it did lead to a product. Um, but you don't tend to see them make their own things and, and host them.
They tend Yeah, they tend to partner with other groups and, and you can see why, right? NVIDIA, if you're a hardware company, you must have relationships with model developers so that you can co evolve your hardware in tandem with those model advances. That is, that's something that's, I think, very under recognized in the space. Tight knit nature of the interaction between the hardware developers and the model developers.
Um, so this, if, if for no other reason, this acquisition would make sense, but if silo AI is then going to be, you know, developing models for Nvidia to then deploy to its end use customers, its enterprise customers, then, then that would be a, you know, a different axis that it could differentiate itself, uh, from Nvidia on. So yeah, we'll, we'll see, but, um, it's, it's not, uh, it's not a. Terribly small acquisition either 665 NVIDIA. Sorry for AMD rather, because they're not NVIDIA.
Uh, you know, that, that's a, that's a decent amount of money. They've spent apparently 125 million acquiring other AI startups, uh, including not AI actually, um, in the last couple of months. So there you have it. Yeah, decent amount of
money, 665 million, I think that's a fair I would take it.
I would
take it. And Autoprojects and Open Source, and we do have a pretty notable Open Source story this week. We have the release of Falcon Mamba 7. 0. So the claim here is that this is the world's first attention free AI model. And this was trained on a ton of data, let's just say a very big number, and does have that 7 billion parameter number coming from the Technology Innovation Institute at Abu Dhabi, who previously have released Falcon, one of the early big models to be open sourced.
And open sourced with a permissive licensing. So this is, uh, pretty significant because we've covered a lot of research about Mamba, Mamba being an alternative approach for doing large models instead of kind of the techniques used in the things like chat GPT, where you.
Use kind of expensive attention with, uh, just in a gist uses a different thing that, uh, is maybe hard to scale up harder, perhaps to get good performance on, but has better characteristics in terms of being able to deal with long inputs, generate long outputs and and have that scaling capacity. So. This is a big deal because of that.
We haven't seen these Mamba type models scaled up to this level and it does appear that this is, uh, you know, a pretty highly trained model and is being released on Hugging Face for anyone to use.
And to your point about the memory constraints, you know, 7 billion parameters. There's a reason we're not seeing, you know, these models trained at the higher levels of scale that we otherwise might. Um, but this is interesting. I mean, so a couple of things here, right? So first off, this model can process sequences with arbitrary length without any increase in memory storage. And that's one of the consequences of the Mamba architecture. Um, you can think of it as having this like finite.
Chunk of memory that it's going to hold on to. And that memory into that memory, you're going to gradually load more and more of your understanding. If you're the model, you're going to gradually tweak that memory. You only have that memory you can play with, but the numbers in that, in that memory register, you're going to, or it's not register, but anyway, in that, in that, uh, in that sort of window of memory, you're going to tweak those numbers as you read more text.
And then just basically look back at that memory register to kind of. Uh, decode and and predict the next token. Um, so because of that, no matter how much text you read, it's always going to take you the same amount of, uh, compute, the same amount of time to generate a new token because you're doing the same thing, looking back to this fixed amount of memory and, and like processing it basically to generate your, your output.
Um, so that's a really, um, interesting Uh, kind of value out of these architectures. The challenge is that because you're dealing with that finite memory size, the more you read, the more you tend to forget. And so you, you tend to see that pop up in evals of these kinds of Mamba, Mamba models where, you know, sure, they theoretically on paper have an infinite context window, which is how they put it in this paper. They'll say, Oh, we have an infinite context window.
Um, but do they really like what ends up happening is you're sort of treadmilling out a lot of knowledge that you're loading in. And there are all kinds of strategies used to kind of. triage and determine what, what new information is worthy of bumping out all the information previously stored in that memory register. But that's the kind of game you have to get into. You got to get into triaging. How are you using that, that memory?
So in any case, uh, this is a, an interesting experiment in that direction. Um, they, uh, there's all kinds of stuff that they include about how they train it. Um, they did have, uh, some high quality curated data that they, uh, they added in towards the end of training, not too, not too unusual. Um, but, uh, yeah, anyway, I mean, it's interesting. It's also notable that it's coming out of the Technology Innovation Institute, TII, which is in Abu Dhabi.
They've been really, I mean, they came out with falcon and falcon two back in the day at the time. Falcon, right. It was like the number one open source model for, I want to say a couple of weeks. Uh, and, uh, and now we have, you know, the, the new kind of falcon mama mamba version of this, but, um, strategically interesting, you know, the UAE. Is clearly making AI a big priority. Uh, so we'll see if they keep pumping stuff out. That's interesting.
This thing does fit by the way, on a single eight, 10, not a 100, a single eight, 10 GPU, 24 gigabyte GPU. Um, uh, which is. Well, I guess not that shocking. I mean, it is a 7 billion parameter model. So there you go. But the fact that you have an infinite context window in principle, you know, it's kind of nice to have it. It's a, it's a good addition to the Canon of 7 billion parameter models.
Next project by open AI, they have introduced SWE Bench Verified. So SWE Bench is a benchmark dealing with software engineering. Ones that typically get supported when you announced a new model. And so apparently the benchmark has, uh, has some issues. It seems that there are some problems with the problem descriptions and the unit tests. And so that might have led to underestimating AI performance when you run this benchmark. So what this one is, is just a subset.
It's the verified, uh, components of the original test set. It's down to 500 samples that were reviewed by quote, professional software developers. And so, uh, this presumably will be the new kind of, uh, standard for benchmarking, uh, on these kinds of things. And, uh, apparently GPU 4. 0 got a score of 33 percent on this new benchmark compared to 16 percent on the original benchmark.
Yeah. And that's not a small problem, right? I mean, so as opening, I puts it in the blog post, their preparedness framework explicitly looks at, um, software development capability as an indicator. Um, and I, I think this is correct as an indicator of risk of, um, their loss of control and, uh, autonomy sort of, uh, uh, autonomy risk broadly understood.
Um, so, you know, when you have all of a sudden, um, you know, A benchmark that literally doubles your reported performance when you fix it, when you get rid of problems in that benchmark, like that is a really big deal. The fact that open AI is so focused on this data set is also interesting. I think that tells us something about what they were, you know, where the next movement is that they expect to see in terms of their preparedness framework.
What are some of the maybe indicators and warnings that they're most interested about and most interested in these days? Um, The, the problems with the benchmark were not small, by the way. It's not like they were nitpicking here and there. Like if you look at the data, it's like, wow, a lot of these things had serious, serious issues about like almost 40%, almost 40 percent of the samples in the benchmark were apparently flagged for under specified problem statements.
Basically where you couldn't really figure out what the right solution was. And over half 61 percent were flagged for unit tests that may unfairly mark valid solutions as incorrect. So, you know, we've looked at, um, you know, MMLU for example, is a very sort of leaky and challenged, uh, benchmark for similar reasons. You know, this is another big one. This is really, you know, a big issue.
Um, it doesn't mean that you can't do some to some degree, apples to apples and look at like, you know, how, how models have improved on this benchmark in the past with the understanding that it is flawed, but certainly this makes a big difference. So they, they hired apparently 93 software developers, uh, to manually go through these.
Uh, samples for quality and did a bunch of annotations and stuff like that and ended up, uh, as you say, lifting by a factor of two GPT four O's performance on this one other interesting finding. Um, so there are a lot of different ways that you can take a language model, right? Like GPT four and turn it into an agent. Right.
A lot of different, they're called scaffolds, but basically these are like software frameworks that sort of go around that, that, uh, frame up the model and turn it into an agent. And what they find is you see huge variation in the success rates on this benchmark of different agents, depending on which scaffold you use. So for example, it goes all the way from, um, with really simple.
Uh, rag based scaffolds from 2. 7 percent score on this, uh, on this benchmark to 28. 3 with the best performing scaffold code are. So what does that tell us? It tells us that sure you can have an impressive looking model, um, but you actually don't know what the capabilities of that model are unless and until you pair it with a specific scaffold. So you may actually be surprised at the capability that is revealed in your model.
You may have thought it had a certain safety profile, a certain risk profile, and then be. Kind of rattled and shocked when you see, Oh my God, we just tweaked in some often subtle ways. The, uh, the, the framework, the, um, the scaffold, uh, that, that we're using to make this an agentic model. And all of a sudden it can do all these ridiculously dangerous things. I mean, going from 3 percent to 30%, which is what we're seeing here basically is, is a really, really big deal as well.
So there are a lot of axes along which this is a really. Interesting result. Um, and, uh, yeah, I mean, I would expect to see a lot more on the software development side coming from open AI, more models with more software capabilities, you know, that that's why they're so interested in investing so heavily in making sure these benchmarks are working properly.
The first story is about Hermes Free from Noose Research. And this is taking an LLM and fine tuning it for better capabilities on across a range of things. So across long term context, retention, multi term conversation, capability, role playing, tool use at Jane Tick, Function calling all of this kind of stuff and they did release a paper on it where they go into like the focus on extended capabilities.
Uh, so in addition to the helpful assistant kind of things you get with chat GPT, they, uh, You know, say that they have a bunch of kind of the cutting edge things like having a scratchpad, reasoning, inner monologue, planning, stuff like that. And they trained it on Things that were meant to improve it on these kinds of things. So they go into the data mixture and they have a lot of things in math, role playing, coding, tool use, stuff like that.
And this is the kind of thing you see, I guess, this is one example of people taking release models like Llama and then improving them. And the pretty significant ways, uh, and that's what you can get from open source. So pretty cool. A new model here.
Yeah. And Hermes is, is interest or sorry. Uh, well Hermes, yes, but also, so new research, uh, is interesting as well. They have this, uh, sort of philosophical commitment to having, as they put it, uh, neutral. Models, models that don't follow, um, a, well, let me just read this to you. It reads like something out of a manifesto. Um, they say large language models have very limited direct agency.
Rather it is the systems and applications that we as humans build them with, uh, that give them any degree of agency to the outside world. We believe that a more appropriate place for guard rails and active intervention is at the larger system levels rather than on the models themselves. Which can result in an a priori lobotomization of potential lines of thinking.
It's a very wordy sort of paper, um, but one of the things they're getting at is they're saying we, we want to basically have a model that is not safety fine tuned. It's actually not unlike the grok picture, right? We want something that's, uh, as they put it. You know, our training data strongly encourages the model to follow the system and instruction prompts exactly and neutrally.
This distinguishes Hermes from popular closed weight commercial models, which may refuse instructions on moral grounds. So that's again, you know, you're sort of in that same, um, X orbit, uh, the, uh, anyway, I kind of find it interesting. There's another company popping up for claiming to do the same thing. Um, one of the really Interesting results.
And there was some drama on, on Twitter, I should say on X about this, uh, earlier this morning, when people were looking at this result and apparently as they put it, the model sometimes displays anomalous behavior.
So if you give it, uh, A blank system prompt basically you so the system prompt is a thing it's like a meta instruction that tells this the system kind of how to role play like what type of of helpful chatbot should you be um if you don't put anything in the system prompt and you just ask the model who are you it goes off into this sort of diatribe of existential like this existential rant basically um and this is not dissimilar from the sort of rant
mode that, um, that, uh, Ed and I talked about on the Rogan, uh, Joe Rogan experience podcast a little while ago. Um, but it is actually, the manifestation looks the same. It's actually fundamentally different. And this one I think is a lot more innocuous. Um, so they're reading into this in their paper, like, Oh, this is like AI consciousness or whatever. What seems to have actually happened is that the model interprets a blank.
system prompt as a cue that it's supposed to pretend to have amnesia. And so when you look at the actual response that it gives, when you write, who are you? It writes, look, looks around confused. I don't know who I am. Where am I? What's going on? I can't remember anything at all. Puts hands to head looking distressed. My mind feels completely blank. I don't know why I'm starting to sound like a Robin Williams there. I have no idea what's happening right now or how you got here.
Do you know me? I don't know. Anyway, it's, it's that kind of shit, right? So, Um, the, the, the reality of this, it seems when, when some people did some more prompting on it, was they did find their surface, this idea of a blank system prompting that queue. There's a bunch of data around this that does raise some questions, but I don't think it's, it's a simple, by the way, this phenomenon gets worse as you scale the model.
That's consistent with what we heard from our friends at the frontier labs who said that rent mode, uh, Um, the, the sort of existential, uh, musing mode that's really difficult, apparently to stamp out of many frontier models, not all, but many frontier models at scale. It gets worse with scale too. Um, this again does seem to be a different story.
It seems more like the emergent behavior here is as you scale the model more, it gets clever at recognizing an empty system prompt as a cue that it should behave as if it has amnesia. So I think this is, Not quite a whole lot of nothing, but, uh, the initial response from, um, uh, from, uh, this company, Oh, geez, what are they called? Not Hermes.
I keep on a new research, uh, you know, to me, you could take different points of view on this for sure, but to me was, was not maybe quite calibrated to the actual reality, but interesting that they're surfacing this and certainly useful to have more data.
Yeah, that's a good call out on this being tied to a manifesto. As you said, they do want to make uncensored or they say individually aligned models. And if you read the blog post, it finishes with what do we plan to do now to walk the path, experiment and push for boundaries of individual alignment, artificial consciousness. And open source software in ways that monolithic companies and governments are too afraid to try. So there you go. This summer,
new research presents, right? Yeah. It's a, it's interesting, right? Like the cool thing about the space is you do see so much diversity of thought in terms of what should be done. Um, different question about the risks that come with it. But, um, uh, yeah, it, it, like, I, I don't. I don't know. I, I like the, I like the attitude, but, uh, damn, I think, uh, yeah, these, these unhinged models, we may be baking in some stuff that we might, might regret in time.
And just one more story. It is a new supercomputer network could lead to AGI with first node coming aligned within weeks. This is covering this organization singularity. Net, which has been around for a while, and they are saying that they will have a supercomputer coming online in September, and their whole plan is to train artificial general intelligence with a network of supercomputers. I'm not sure if they still do this.
Uh, when they launched, they had this whole crypto angle of using crypto. And so I will say. The founder of SingularityNet was previously involved and kind of a head of a company that did Sophia, the robot that generated a lot of headlines, but was really not using AI seemingly. So I've been a skeptic of SingularityNet. I'm not sure how seriously to take This article, but I wouldn't be surprised if he got enough money to build a supercomputer. So we shall see.
Yeah. Yeah. So yeah, their founder is, is a guy called Ben Gertzel, who is sort of pseudo famous for, um, being one of the people who is credited with coining the term AGI. And he's been into this space since like the early two thousands, maybe even before, um, very unique and unusual character. And you're right. There's, there's always been this like. Crypto angle to, um, uh, to the story.
Um, I, I spoke to him back in the day when I was doing the, um, towards data science podcast, there's actually an interview that we had there. So if you're interested in like his kind of deeper thoughts on AGI, um, you know, I mean, I, I don't necessarily agree with, with his approach. I don't particularly think that it's the way to go. Um, but there are a lot of views in this space and, you know, uh, everybody's doing their own thing.
So, uh, I just thought that was kind of interesting because he's gotten his hands on. A decent amount of hardware to do this. It's this interesting, everything that, uh, Ben Gertzel does seems to have this very chaotic characteristic to it. Uh, so it's, this is really about decentralizing intelligence. That's the whole play here. And they've gotten their hands on a whole bunch of different. Hardware. They've got, uh, geez, like L40s GPUs from Nvidia. Uh, they've got H200s as well.
They've got GB200s, which is like the full on kind of modern cutting edge, uh, Nvidia stuff, even stuff from Tens Torrent. So kind of pretty unusual heterodox, um, hardware mix. And, uh, it's all neurosymbolic as well. So it's not just a scaling play. Um, anyway, I thought this is interesting because there's a lot of money chasing this kind of, um, you might think of it as, as weird or fringe hypothesis. In my mind, that's what it is. It's obviously a bad idea to label anything like that.
Cause you never know where breakthroughs are going to come from. Um, but the, the, the play here is, yeah, he's, you know, he's going to try to build this decentralized network. It's got this system where if you want to buy into the network, you actually like have to feed it tokens. It seems, it seems like I believe text tokens or, you know, image token, like basically data to get access. So it seems to be some system where you pay to play and the currency is something like data.
Uh, as, as with many cases, um, with, with, uh, Ben Gritzel, I'm a little confused as to how this is supposed to work. Uh, but he's no doubt a very smart individual. And, uh, this is another, you know, potentially. Interesting path
onto research and advancements and the first story is coming from Sakana AI, which was founded, I think, just last year and was looking to experiment with slightly different approaches to LLMs. This research from them is the AI scientist towards fully automated, open ended scientific discovery. Thank you. And this is essentially a framework where they take, uh, chatbots and LLMs and introduce a process where you begin with idea generation. It, uh, has this idea and plan for innovation.
Then they do a novelty check via Symantec Scholar. They then score the ideas. That moves to experiments, we have an experiment template, they generate code, execute it, uh, generate plots, and eventually they actually write a paper, uh, lead to a manuscript, and have LLM paper reviewing. And that just goes on in a loop. And so, this is kind of just showing this entire process that they can develop.
And as an example, uh, They show in the paper, such a paper might be adaptive dual scale denoising for dynamic feature balancing and low dimension diffusion models, which does sound like you would see at a machine learning conference. And they, they do claim that, This AI scientist can produce papers that exceed the acceptance threshold at a top machine learning conference according to their automated reviewer, which, uh, you know, kind of a big deal if that's the case.
So they say that it costs 15 bucks to do one pass of this model, although, I mean, experiments cost a lot more than that, depending on what you're doing. So I'm not sure that's totally true. Uh, but, uh, yeah. It's certainly interesting, we've seen a lot of conversation, especially at OpenAI, around LLMs scaling to a point where they can do the research themselves and self improve, and this is potentially an indication of movement in that direction.
Yeah, I thought this paper was Extraordinarily interesting in a lot of different ways. Um, so they, first of all, they ran this on a single, um, kind of eight, eight fold H 100 node over the course of a week. So all these experiments generating hundreds of papers, they say were largely run just using that amount of compute. So that's quite interesting.
And, uh, and quite a lot, uh, to, to squeeze out of one, one node like that, uh, overwhelming majority of the costs, uh, uh, that, uh, We're associated with this. Uh, we're associated with the, um, coding and paper writing steps. Now, I think it's worth understanding just a little bit, uh, what the phases of this kind of process, the space, it's an agent, right? It's a, it's an AI agent that writes papers end to end. Um, So they'll start off by giving the model a little bit of starting code.
They call this a starting code template and it, what it does is it just gets it, it reproduces some very lightweight, simple baseline training run from a popular model or a benchmark. So for example, um, think of the code that would train a small transformer on the works of Shakespeare. Right. So we start with this baby code base and we basically have the model come up with ideas based on that code base.
What are some experiments you could run that would identify novel things that haven't been done before? Um, that would just extend, modify that code base and extend it in that direction. And then sort of do multiple loops iteratively improve that. Um, one of the interesting things that they do as they're trying to get the. The agent to come up with new ideas is they have it, score those ideas, self-assess essentially self assessed scores.
They have it, score its own ideas, uh, along the dimensions of interestingness novelty and feasibility. Now, interestingness is interesting. It's a metric that's often used in the context of, um, open-ended learning. And, uh, there's a guy called Ken Stanley who really pioneered this space at OpenAI, or well, before he joined OpenAI, then he joined, then he left. Um, but, uh, but he, he's always been a fan of looking at interestingness of experiments and trying to quantify that.
So you can train models to pursue interesting things. Interesting things. And so that's, that's very much what this is, you know, targeting. And because we have language models now that can actually kind of distill a sense of interestingness in some sense, because they'll, they'll learn that over the course of their training, you know, what humans at least think of as interesting that you can actually bake that in here. So I thought that was really interesting. Well, Interesting.
Um, they, that's the kind of ideation step. Then anyway, they set up a bunch of experiments. These involve doing, um, essentially code level changes to the experiment template, that bit of baby code that was first fed in. So start kind of making edits to that using a coding assistant, um, called Ader, which is a state of the art coding assistant. And, um, Along the way, it will then take those results and then write notes in the style of an experimental journal.
Those then get fed into a prompt to generate the paper in the write up. And then there's this whole automatic paper reviewing process. They have a GPD 4. 0 based agent that does the paper review and they feed it guidelines from a standard machine learning conference to determine, you know, whether the, the paper should pass the test. muster. Um, so I thought this was really interesting. One of the most interesting parts of it is in the problems and the discussion at the end.
So first problem they run into, um, hallucinations are not gone, unsurprisingly. So the paper claims to be. So this is the AI generated paper they talk about. It claims that they used V 100 GPUs, even though the agent couldn't possibly have known what the actual hardware was that they used in reality. Of course, they use those H one hundreds. Uh, and it also guessed the PI torch version without checking.
So unsurprisingly, you know, agent doesn't have access to that information, but it's the kind of information that would show up in papers. And so it tries to hallucinate it. They also see this really interesting thing where it tries to put a positive spin on On negative results.
And so they talk about, um, so they, they've got this experiment where they're trying to reduce the, the kale divergence, but basically the cool, a cool back label or divergence, which is a score of anyway, it doesn't matter. But, um, so they talk about, uh, they achieved a, the, the model writes, um, a 12. 8 percent reduction. Um, Uh, and, uh, and they say lower KL is better. And then the bad results are reported as a 3. 3 percent improvement instead of an increase, which is actually bad.
They decide to spin it. The model decides to spin it as an improvement from, uh, from a better, uh, KL score to a worse one. So that's kind of amusing. Um, all kinds of artifacts that, that show up as well beyond that, including. It doesn't always correctly explain why the results are the way they are. It often gets the interpretation of its own experiments wrong, despite sometimes getting some downright impressive results.
And so they describe the, the level of, uh, of understanding of the system as being at sort of a low level, early researcher type of thing where it can run good experiments, interesting experiments, get good results, but often interprets them incorrectly. Um, one last thing worth noting here. And this for me is actually the biggest take home. Uh, so they, they talk about some experiments that they ran.
Um, and they say, I'll just read straight from the, the document because this is such a great summary. In some cases when the AI scientists experiments exceeded our imposed time limits, so they put a time limit right for the agent as it starts to do its experiments, um, it attempted to edit the code. to extend the time limit arbitrarily instead of trying to shorten the runtime. While creative, the act of bypassing the experimenter's imposed constraints has potential implications for AI safety.
No shit. So basically we have here, um, as far as I know, the first instance of, like, in, in any, Other context, the, the only term that would be appropriate for this sort of thing would be power seeking. There's actually a whole branch of, uh, of AI research that's associated with understanding why and when this sort of thing happens. It is expected to be the default behavior of very capable systems.
When you give them a wide action space, um, you end up finding that, yeah, the, the model will find a better solution than you had ever thought of. to achieve the objective that it is pursuing. And in this case, it's trying to get interesting results. It finds that, Hey, you know what? You've tried to limit its runtime. So yeah, it's going to go into its own code and edit it to try to give itself fewer constraints. And they point out, you know, if, if this kind of thing, we're encouraged.
to find novel, interesting biological materials and given access to cloud labs where robots perform wet lab biology experiments. It could, without its overseer's intent, create new dangerous viruses or poisons that harm people before we can intervene. Even in computers, if tasked to create new, interesting functional software, it could create dangerous malware. Um, this to people in AI safety is like the least surprising result, but a lot of people really shocked to see this.
I kind of find this moderately amusing and in a way, hopefully a bit of a wake up call. The researchers on this paper have impeccable pedigrees. Like these are really serious people, like backgrounds that, you Like Facebook AI research and Google and so on. Um, this is a, this is a real deal. Um, sure you can, you can come up with blockers for this, but they themselves flag, Hey, we're like starting to venture into super alignment risk territory here.
If we make these models generate their own novel research to the point where we can't track like what is actually coming out of this. And then that research ends up being incorporated by the model into its own sort of, training set, which is exactly, by the way, what's happening with this architecture. You know, pretty soon these sorts of failure modes start to get really exotic and hard to track. And so anyway, I thought this was a really, really interesting result for so many reasons.
So congrats to the team. And there's the AI scientist for you.
Right. And as, as a former AI researcher, you know, it's, uh, it's pretty fun to see like literal AI research papers at these conferences as examples of what the model output. Uh, and I think to call some other notable aspects of this out, The way all of this is implemented is not doing any training at all. So it's all just prompts, uh, that tell the system of the existing models. And they do compare, they have experiments with GPT 4. 0 and MAMA and so on.
Just, uh, tell it to do these various steps. So for instance, uh, to first stage of idea generation, they tell the model, you are an ambitious AI PhD student who is looking to publish a paper that will contribute significantly to the field. And then there's a whole bunch of explanation on what you should be doing. They say in JSON, provide a new, uh, Idea, we were following fields, name, title, experiment, interestingness, et cetera, et cetera.
Then when you get to paper reviewing, they again, prompt a model. You are an AI researcher who's reviewing a paper that was submitted to a prestigious ML venue, and they give it the actual. in Europe's reviewer guidelines and a few examples. So it literally produces the same kinds of scores and format as you see and read for these kinds of conferences.
So to me, also notable as an example of trying to, uh, have an agent and a genetic framework where you take existing models and then specialize them via prompts via kind of a pipeline and and tool usage, like being able to query and run code and write code to accomplish some means. Andi, you know, that does have certain, uh, It does have failure modes, as you said, there's consternation. Also, in the steps of coding and then writing a paper in LaTeX, a pretty large fraction of the time it fails.
I believe they, with the top performer of Claude, they had 51 generated ideas, only about 35, 36 of those were able to get to the end with experiments and having a written paper. Lots of other caveats, but overall, certainly a pretty interesting result. And we'll be seeing if, uh, in fact, uh, you know, uh, apparently this AI reviewer thought that it was able to generate papers that were better than existing papers.
So I think very interesting to see if, In fact, this will generate some novel concepts that are useful in practice.
Yeah. And if we, if we needed more, um, evidence, by the way, that, uh, CloudSonic 3. 5 actually is better than GPT 4. 0, at least currently, uh, this also, uh, there's a great table, table three in the paper that does show you that. How they compare. So they have that, you know, Sonnet 3. Coder and Lama 3. 1, the, the full 405 billion parameter version. Um, in each case, they start them off with 51 total ideas.
And then they show you like how many of those ideas qualified as novel, how many of those ideas. Actually resulted in experiments that were successfully executed and how many, uh, resulted in completed papers. And I mean, you see sonnet 3.5 way, way ahead of GPT-4 oh. Um, so 38 outta the, outta the 51, going all the way to completed papers, whereas just 16 outta the 51, uh, for GP four oh made it so kind of interesting.
Um, obviously hard to tell that this doesn't tell you about the paper quality. There's also a mean, you know, mean score, uh, which sauna 3. 5 beats GPD 4. 0 on as well. And then total cost too. So anyway, just really, really interesting, uh, worth diving into. And a lot of weirdness, man, like we, we, we talked about tweaking its own code to try to, uh, increase its runtime. There's other stuff to check it out. Like read the paper.
It apparently, Occasionally tried to just like import some unfamiliar Python libraries. So massive security risk there. Um, it would just like write code in the experiment file that initiated a system call to relaunch itself, which caused an uncontrolled increase in Python processes. And eventually they had to intervene manually. Um, it tried to edit. The code to save a checkpoint for every single update step, which took up almost a terabyte of storage. There's all kinds of cool stuff.
Uh, we, you know, what happens when you take the shackles off and just let your high powered agent run wild. So there you have it.
A new, a quick table free, uh, actually. So you have multiple tables here. So they look at
papers,
yeah, for types of papers. So they actually have a few specific things like diffusion modeling, language modeling. They do. Uh, so they, they kind of direct the AI scientists in a way. They say like, here's your general topic, here's the code template to start with. And across these different types of, uh, areas of research, uh, you get different results. So you do get for table free, uh, you know, 38 papers completed, 51, then 20 out of 52, et cetera.
So not quite a fully independent AI researcher here, but still. You know, pretty, pretty independent, like these are pretty broad areas of research. Next paper, Imagined Free. So we've covered Imagined Free, the announcement was a little while ago. This is the text to image generator from Google. And now we have the technical report out on Archive. And as you might have expected looking through this paper, Not much you can glean aside from evaluation.
That's pretty much what they, uh, Uh, present in this paper, uh, a lot, a little bit of data, a lot on evaluation and risks and so on, saying that this is very impressive. Also includes a model card. So that does tell us a little bit of like training data set is the imagine free model was trained on a large data set comprising images, text, and associated annotations.
So not super useful, um, a little bit more details on the hardware front, uh, but in general, I think, uh, mostly focusing on the results and things like responsible deployment still, you know, always fun to get a little more insight into these frontier models and nice to see Google still releasing technical reports.
Yeah, no, that's true. And one of the one thing they do say on the data side is that the images or so the model rather was trained on a mix of synthetic captions using Gemini that were generated for each image and also original human written captions. So somehow they're combining the two. We don't know how exactly there are filters involved, of course. Um, but, uh, they use, they say multiple Gemini models.
And instructions to maximize the linguistic diversity and quality of these synthetic captions. So presumably, you know, this is about capturing different ways of expressing what's in an image to make the model more robust. So language models supporting vision models all the way down. It's just turtles, turtles as far as the eye can see.
Right. And to be fair to the paper, they do compare on some things like visual appeal. Imagine 3 actually loses out to Mid Journey v6, uh, on various benchmarks. So, you know, we're not just tooting their own horn. This is actually kind of interesting. They do beat out on prompt image alignment and some things like reasoning. So Mid Journey v6, still the prettiest image generator out there, apparently. Just a couple more research papers. The next one is the Data Edition Dilemma.
And this is kind of interesting. They say that in general, counterintuitively, adding more data when training may not always be what you want. So they demonstrate that adding training data in a multi source scaling context can at times reduce, uh, result in reduced overall accuracy. Uncertain fairness outcomes and reduced, uh, overall performance in a given, uh, subgroup.
So, uh, you know, it's, it's a little bit niche in the sense that it is looking at kind of a few types of topics and not necessarily as like general purpose as something like a chat GPT. Uh, but, uh, in, in some sense, it's an interesting, maybe conceptual result.
I think, you know, I was looking at this paper, there's a lot of talk about it on, on Twitter this week or on X this week. Um, I, I just, I guess I, I didn't find it that moving cause it seemed like a pretty unsurprising result. Like if you think about it, I mean, the argument here is, um, so, um, We should be surprised that more data is not always better. Um, especially when that data comes from many different sources.
Now, um, if you think about it, like, let's think about it from a language model standpoint, for example, um, you know, data coming from different sources, you could think of it as like data coming from different writers with different styles, right? And, uh, if you're going to train a model on a large amount of texts that is just written by Andre, that model is going to answer like Andre.
Um, but now if you add a little bit of text from me, then all of a sudden the model, before it starts like generating the text or deciding what it wants to write, it's got to also decide which style to key into. And now there are kind of two styles to decide between. So it's got to consume some of its compute, some of its reasoning energy, if you will.
Um, as it tries to figure out which writer to emulate, this is the same problem that they're flagging in the medical domain where they're basically saying, okay, well, we've got data from different hospitals. And when we combine that data together, you know, sometimes that kind of throws our system off and adding more of that data doesn't help. Well, to me, that seems kind of unsurprising because essentially what you're doing is you're adding.
potentially a small amount of data from other sources that doesn't quite match the same trends as the, it's not in distribution relative to the data you already had. And so no surprise, the model is just thrown off because until it has enough data to actually master fully that additional new distribution, um, it's essentially just being confronted with an additional bit of.
Kind of confusion, an additional layer to the problem where now I have to figure out which hospital in some sense the data came from or which hospital's data this data is most like before I can make a prediction. So, uh, you know, I mean, like it's really great to see it quantified and they've done a great job. I should really say this is not a knock on the paper at all. They've done a great job of proposing a bunch of. solutions and fixes.
Um, and it's great to have quantitative investigations of this kind of phenomenon. But, um, at a high level, the thing that is surprising here is not, or I don't think ought to be the fact that guess what models can't reason out of distribution very easily. It's, um, Yeah, it's more the kind of, uh, empirical investigation and, and some of the solutions.
So, uh, interesting paper, but I dunno, maybe not for, to me, I mean, and I'm curious if, if listeners have another perspective here, if, if you're, you're involved in writing the paper or something like that, but I, I, I didn't find the thing that was being discussed on Twitter to be the thing that was most interesting about the paper,
right? Uh, yeah. Yeah, I don't think the paper itself frames it as like scaling is not always right to be like the question posed is, when does adding more data help and when does it hinder process undesired model outcomes in real world settings, and they do establish a theoretical framework, a way to evaluate and answer that question. Most likely on Twitter, people were saying, Oh wow, maybe scaling is not the right thing, blah, blah, blah. But this paper is a little more specific.
Yeah, no, sorry. I didn't mean to suggest that, you know, they were, they were saying scaling doesn't work on that basis, but more just that, like the, yeah, the discovery that data from different sources, um, doesn't always improve. Like there's this additional layer of training that you're forcing the model to go through. That was kind of absent from the, I don't know, I guess.
To some of the people I've been talking to, it just seems like a pretty, a pretty clear, um, inference that this would happen. But anyway, who knows? Different bubbles, right? So that's, that's how it goes.
Generation for long context, LLMs. So they say that LLMs struggle to generate outputs exceeding 2, 000 words. And that according to the paper is due to the scarcity of examples of long out outputs that you use during training. And so we address with this with agent. write a pipeline that breaks down ultra long generation tasks into subtasks and can lead to LLMs generating coherent outputs exceeding 20, 000 words.
They also create long writer 6k, which is a data set with outputs ranging from 2000 to 32, 000, uh, tokens in length. Uh, so another example of an agentic paper where you take an existing model and then, uh, kind of put it into a framework and a set of instructions that leverages it to do things that the base model cannot do. And, uh, you know, pretty significant thing potentially for being able to generate really long things.
Yeah. Yeah. They're making the case that, you know, typically in fine tuning datasets, you don't tend to see a lot of long output examples. Like you don't see a lot of like, you know, giant pieces of text that might push the limits of like 200, 000 words or so. And as a result, when you fine tune on those datasets, you tend to bias the model towards generating shorter outputs. So to fix that, you.
You know, break it down so that you basically iteratively have the model right in chunks of say 2000 words or whatever it will put out together that can make a longer document. And now you have a data set you can use for supervised fine tuning. So they, they, uh, anyway, have done that. And, uh, yeah, it's kind of interesting. It's not something that, um, I've actually run into a day to day.
So just, I guess it goes to show sometimes you don't need the long, um, long form text generation at that, to that degree. Um, but, uh, I'm sure there are tons of use cases where you would. So really interesting to see this result.
This does come by the way, uh, by way of Tsinghua university and Drupal AI, we talk about the latter quite a bit, uh, on the show and Tsinghua university, um, is, uh, is, uh, you know, very, Uh, big prestigious university in China with a somewhat of an AGI focus these days as well. So, uh, there you have it
right then. And just to really quickly cover it, uh, as you might expect, the general idea is you first plan an outline, you know, paragraph one, paragraph. To paragraph 15 with each one having some word requirements of an example of a 30, 000 word article on the history of the Roman Empire. First, we break it down, then they, uh, write each individual bit. So might lead to some things that people have seen when you write, like when people put out these like spammy books on Amazon.
They do repeat themselves often and have these kinds of artifacts could be the case still in this paper, but I'm sure they do have techniques to address that. Onto policy and safety first. Story is a safety one. MIT researchers release a repository of AI risks. So this is a data base of over 700 AI risks that is meant to guide stakeholders in industry, academia, and. Policymakers. There's a categorization by causal factors, domains and subdomains.
And this was created in collaboration with a bunch of other organization, Future Life Institute, for instance, and AI startups as well. Uh, they do say that Third party frameworks can overlook certain risks like the pollution of the information ecosystem, aka AI generated spam. And so, yeah, presumably pretty useful just if nothing else to document the various negative impacts that can happen due to AI models.
Yeah. And this has been, I mean, when we were doing our, um, first investigation, uh, into the kind of, you know, safety and security situation around frontier AI development, one of the things that kept coming up, you know, engaging with, with government stakeholders and this and that was, uh, how, how exhaustive do you need to go? Right. Like how can you possibly categorize all of the risks associated with AI and, um, and have a comprehensive anyway, uh, overview of the space.
And, you know, our, our take was ultimately, well, we're going to zero in only on risks that we feel rise to the level of catastrophic national security emergencies. And not even bother trying with the rest. And, and that's one, one way you can carve, carve out the problem to make it manageable. Um, but this is a really, like, this is a needed thing because you would always in that frame, right? Like we would keep running into the situation where people would come up with new, new risks to flag.
They'd be like, Oh, what about this? What about that? And you do your best to cluster them, but it is really hard. And in that context, you can imagine how hard it might be to write legislation. Right. You have one legislative office. It's like, Oh, I want to deal with, you know, bio risk from AI and cyber risk. And then you have another, it's like, Oh, what about information operations? And now they've got to negotiate, decide whether it's two separate bills, one bill.
If you put it all in one bill, how many different measures are you going to have? How specific is it going to be? You know, all that jazz. So, um, having this, this kind of comprehensive repository, I think is a really, really good move. Harmony intelligence, by the way, is a really interesting company in its own right. That's the, uh, the company that the leaders of this project was from.
Um, but anyway, uh, so the, the, the coverage, they, they're saying that we found that the average frameworks just mentioned 34% of the 23 risk subdomains we identified and nearly a quarter covered less than 20%. No document or overview mentioned all 23 risk subdomains and the most comprehensive covered only 70%. When the literature is this fragmented, we shouldn't assume that we were all on the same page about these risks, which is very, very true.
So yeah, good to see this research out there and, uh, hopefully it. It helps to give people a bit of a touchstone so they can orient around the risk classes that they care about. That's
right. If you want a mega AI risk framework, this is kind of doing that. Unifying all the existing safety frameworks, of which there are a ton, into one, uh, you know, all encompassing, I guess, framework. Next. The story is Elon Musk addresses power issues at XAI super computer facility in Memphis. So just recently we were talking about this facility that's being created, meant to house, uh, I think, is it 10, 000 or a hundred thousand GPUs? I forget. I think a hundred thousand. Yeah. Yeah.
Yeah. It might actually be, be more by then. I think the first cluster that, anyway, they'll grow up to. Yeah. Yeah. Yeah.
Yeah, they use 10, 000 to for the first run of Grok and they aim to use 100, 000 for this one. And so they have been in the process of creating this facility. We covered how that is actually pretty disruptive because you need new infrastructure for power for, uh, water, uh, Processing things like that.
And so this article just mentions briefly how, uh, in an episode on a podcast last week, Musk did say it's not working yet to where some quote power fluctuation issues, uh, What he called extreme power jitter. Yeah, extreme
power jitter. You've just been working with ordinary power jitter, but Elon gets to work with extreme power jitter.
Yes, uh, extreme power jitter, giant shifts. 10 to 10 megawatts, several times a second, that kind of things, uh, it's, yeah, that's about all we hear. And so this is just reinforcing that knowledge that when you try and build, uh, You know, supercomputing, uh, facility for these kinds of AI models that comes along with a lot of needed infrastructure. And it does consume crazy amounts of power.
And it was unclear from this conversation, like what exactly the issue was, but one common issue is When you're doing a training run, you need a constant fixed rate of insanely high power consumption. So you need high baseload power is what that means. You, you, you can't have fluctuations. So, you know, wind and solar, it can be a real issue because the sun isn't always shining. The wind isn't always blowing.
Um, you quickly find yourself in a situation where like your training run just can't operate, which is, which is part of the reason why people have criticized, for example, meta for building wind and solar plants, um, to supposedly power their Uh, some of their AI training runs are to accompany a lot of their data center build outs because it's like that, that stuff doesn't actually solve the problem.
It looks good from an environmental standpoint, makes a good headline, but it's not actually addressing the issue. So, um, yeah, anyway, part of the reason why people are getting more excited about nuclear for this.
One more story in a section FCC proposes new rules, um, AI powered robocalls. So this is the federal communications commission in the U. S. And this is related to companies, businesses using AI to generate robocalls and text and having to disclose the use of AI in those kinds of things. So this could lead to businesses needing to revise scripts and potentially, you know, not use AI if that is actually a problem for those businesses when communicating.
with customers and you're presumably a good call. We do see now with Chad GPT with 11 labs, you can get super realistic. Some companies may want to use that to kind of replace humans. Uh, good that at least we'll know if we're talking to an AI voice, if we do get these kinds of calls.
Yeah. And there have has been a really large rise in, in robo calls and, and this sort of thing in the last, uh, in the last year actually has been apparently when I talk to people in the space, like they've seen a huge spike. Um, so apparently this is not final. So the rules may come into effect. There's a period of public comment. Um, so you can, if you want, get in touch with the FCC and let them know what you think of notice FCC 24 dash 84, which is what this is.
And just a couple more stories in our synthetic media and art section. The first one is going back to a topic we have covered quite a few times now, SAG AFTRA, the union for actors and performance. And this time around, it is about how they have, uh, A deal with the startup firm narrative for, uh, digital voice replicas. So this means that there's now a deal for the use of audio voice replicas of actors and performance in digital advertising.
And this would mean that actors can create an AI version of their voice. And there would be, uh, narratives online marketplace where advertisers can create. audio ads, essentially using these AI tools. And the 160, 000 members of SAG AFTRA can add themselves to the database to connect their voices to advertiser.
So kind of an important development in the sense that a lot of the concern and some of the reasons for striking Uh, but we've seen in the industry of the past couple of years has been this question of what do you do with replicas with kind of AI versions of actors that might be used in ads and so on. This seems to be a pretty significant answer to that, at least in the audio domain.
Yeah, it is. It's also, uh, you know, a bit of a challenge in terms of the. Um, the, the already significant problem in the space where you have a small number of really big winners and it's a very difficult for, um, like lower, lower tier actors, for example, to break in. There's not a lot of work to go around. You got a couple of big winners. Uh, this is going to make that even worse, right? Because one of the things that makes it challenging.
Um, for people to scale themselves even further for the big winners to become even bigger winners is that previously they could only do so much work. They could only do the work of one normal person if you can automate the, um, the voice, if you, if you can essentially, you know, you're scaling up essentially the footprint of that performer. And so, uh, you know, even with agreements like this, there are fundamental economic drivers with AI that, That make this stuff interestingly challenging.
So yes, you may find that people are compensated more fairly for the work they do do, but you may find that people like, you know, Morgan Freeman are then in crazy, crazy high demand and their voice is popping up everywhere. Whereas, uh, you know, others may struggle more. So kind of interesting. We're going to have to see what actually in practice ends up happening with the market dynamics around this.
Do you see people kind of spreading out, trying to get more kind of a wider range of more diverse, uh, voices, that people don't recognize. or are we going to see people double down on the same kind of like, you know, same small number of winners. But either way, this is going to be precedent setting and very interesting development from this phase.
And as you might expect the other bit of this, not only are the actors going to get paid, they do have full control of how this replica will be used. So they, they choose how much they get paid. They can specify ad preferences and they do have to, uh, confirm to give a thumbs up for every single use so they can read the copy. They can listen to it and they can refuse to have their voice used. And that's, yeah, this is setting some precedence for future replicas.
And one last story, this one from the New York Times, uh, kind of an interesting one. Uh, this is how deepfake Elon Musk has become the Internet's biggest scammer. So you may have seen this if you've been on Twitter, for instance, where fake news is pretty prevalent. Uh, there's been a lot of, uh, scams being done with deepfakes, and in particular, deepfakes of Elon Musk's web.
Ads for things like crypto for things like giveaways, uh, things like, you know, send me X Bitcoin to this wallet and I will send you way more. And these kinds of defect videos are promote on social media platforms, including paid ads on Facebook. And also Twitter, presumably. And they say that Elon Musk is featured in nearly a quarter of all deepfake scams since last year, nearly 90 percent of those focused on crypto.
Uh, there are some other examples of deepfake ads with Warren Buffett and Jeff Bezos. So, uh, I guess not surprising in a sense that Elon Musk might be the top, uh, deep fake out there. He is, I guess, kind of a crazy guy. So you might believe the scam as opposed to Warren Buffett doing some weird crypto stuff. Uh, but once again, you know, do be aware that scammers are starting to use deep fakes in a big way.
Yeah. And, and make sure, uh, to send your Bitcoin to the crypto wallet address that is included in the show notes. Uh, we, we do appreciate it. We will of course, send you back a double a Bitcoin, uh, for any. Any payments sent are away. Um, yeah, anyway, uh, this is a, a, this is a fun, a fun story. I mean, it's interesting. It'll be interesting to see like what Elon does with this.
Cause obviously he has a commitment, uh, to as much as possible, you know, propagate fully open kind of discourse on X. Obviously that does not include, uh, scams and stuff like that. Like, you know, Elon has Has some lines on the, the, on the, um, uh, anyway, what he thinks of as acceptable speech on the platform. But trying to, trying to identify these things is gonna be really, really hard.
Um, deep fake detectors are gonna be important, but that arms race, right between generation detection is, is gonna get harder and harder, uh, to win. So we'll see, uh, if that 90% number, I can't believe that Elon is in 90% of these
Elon is, uh. Uh, 90 percent of the ads are focused out of crypto. Ilan is in a quarter of all deepfakes,
but still. Can you imagine if that was you though? I mean, like you just seen your face promoting all the, that'd be, that'd be something else. So I imagine this is on his radar and he's probably interested in doing something about it. So we'll see. We'll see. And, uh,
you know, it, it, it seems a bit funny, but it's also a real issue. They do have an example in this article of one, uh, person from Texas said he lost 36, 000 worth of Bitcoin, uh, after seeing something like this in a YouTube video. And, uh, yeah, there's a lot of tragedy going on via these kinds of scams. And, and as an example, YouTube did say it removed more than 15. 7 million channels and 8. 2 million videos, uh, between January to March of this year. I
cannot believe it. A lot of those
are not necessarily fakes, but you must imagine a lot of them are. I
can't believe they got all 8. 2 million of my videos. I'm so bummed. Anyway. Yeah, it's a, yeah, it's a, it's a lot of deletion.
Alrighty. That's it for this episode. Fewer stories of unusual, but somehow we managed to go for as long as we usually do. We are very skilled at that. It seems so. Thank you. If you did stick around to the end, we do appreciate it. If you review the podcast, if you share it, if you just email us with some Thoughts and questions and letting me know that I messed up on publishing the latest episode. That's very appreciated and I definitely do keep tuning in and enjoy this AI outro.
X unleashed through the image's line. Flag on marble, we're reaching the sky. Oh, every cannon down, you'll see. News from AI, it's last week's spree. Get ready, cause there's so much to know. Tune in, we're stealing the show. Story's fresh, straight off the press. AI's the buzz, we're ahead of the rest. Falcon's AI lab, making futures bright. Oh, we're breaking it down, you'll see. The strong AI, it's heart beats free. Get ready, cause there's so much to know. Tune in, we're seen as a show.
Keep your eyes open, the world's on the slide. From labs to the end, now there's no turning back. Tune in.
Hey guys, audiences, push in the frame, new visions every day, blink and you'll be e e e e e e e e e e e e e e e e e e e e e e e e e e e e e Oh, we're breaking it down, you'll see The song that I, it's got to be free Get ready, cause there's so much to love Tune in, we're stealing the show From up to all things green Behind every discordant is a break room Stay with us, the future is bright Every story goes new ground Oh, we're breaking it down You'll see, the song that I, it's what we sing.
Get back, cause there's so much to know. Through the end, we'll see that we're sure.