#165 - Sora challenger, Astribot's S1, Med-Gemini, Refusal in LLMs | Last Week in AI podcast

⁠¶ Intro / Banter

Andrey

00:04

Hello and welcome to the latest episode of the Last Week in AI Podcast Week, where we chat t about what's going on with a. As usual, in this episode, we will summarize and discuss some of last week's most interesting AI news. You can also check out our last week in AI newsletter add lastweekin.ai For articles we did not cover in this episode. I am one of your hosts, Andrey Kurenkov. I finished my PhD at Stanford last year and I now work at a generative AI startup.

Jeremie

00:33

And I'm your host, Jeremie. I'm, of course, the co-founder of Gladstone AI, which is a national security and AI company, work on all sorts of issues from, you know, WMD level risk from weaponization of AI to loss of control. And this week, I guess we're kind of I don't know, I was gonna say we're sort of blessed that there isn't as much volume. There's some really high impact stories, but like less volume than maybe usual. So maybe a bit lighter.

Andrey

00:57

Yeah, I would say comparatively speaking, there's no like super huge news to be covering, but there's going to be some, good tidbits, some cool research papers in particular, and all the usual types of announcements that you've been seeing over the last few months as AI continues its rapid pace of development and, you know, release across every single platform in existence. So how about we just go ahead and dive in, starting in the Tools

⁠¶ Tools & Apps

01:28

and Apps section? And first up we have a GitHub releases, an AI powered tool aiming for a radically new way of building software. So this is, technical preview of the new GitHub Copilot workspace, which is a developer environment that builds on top of GitHub Copilot, which is very, quite popular, tool for AI code completion. This is, more of a, development environment where you can do all sorts of stuff.

02:02

It basically almost acts like, assistant in a way where you can, instruct the workspace what your project is, and it can go ahead and set up kind of started files. It can also go ahead and run code test it, kind of much more coverage of the entire development process as opposed to just writing code. So yeah, pretty interesting to see. And this will be in GitHub. So you can, use it now in the web interface, and on desktop rather than just independently.

02:46

GitHub Copilot. Typically you would use in a different program as an integration. Interesting to see them, expanding on the copilot, I guess project.

Jeremie

02:58

Yeah. And it's interesting to see GitHub continue to be like the clear leader when it comes to, you know, the product ization of AI in the context of coding. Right. This goes all the way back to really the Microsoft acquisition of GitHub, because of course they now sit under that Microsoft umbrella along with OpenAI. That agreement means they have priority access to all of OpenAI's

03:17

products. And early on we saw GitHub use it to use OpenAI's Codex, which is like the OG kind of like coding model, the first coding autocomplete that was really rolled into production in a big way. GitHub leverage that post-acquisition in a pretty significant manner. And so what we're now seeing is this era of shifting from, you know, code autocomplete tools, which are very useful, especially for automating, like, you know, more mundane functions.

03:40

Everybody uses it for that. But like moving beyond that to, you know, have a an infrastructure aware workspace, that's all AI powered. You know, as many people have said, right. It takes more to code than just writing lines of code. In Python. You need an awareness of the programing environment, of the execution, of the program know, debugging, all that stuff. This is all part of widening the lens a little bit. And, you know, Microsoft and GitHub clearly are focused on that.

04:06

It's also part of this trend towards more and more agent like design. This may not seem on the surface like one of those plays, but really what we're talking about is, you know, an AI powered workspace that's kind of like, a context aware agent that's going to support you by doing a lot more than just writing code based on the code in the window that you're looking at that's

04:27

already been written. So, you know, when you look about, look at like the next beat, the next beat in terms of automation of, of, software development workloads, this really is that path. I think it's it's a really interesting trajectory. And obviously we're going to see a lot of emulation of this down the line. But, yeah, this really is, it's all the things you might expect for the next beat.

04:47

And I do think, you know, there's been a lot of talk about is AI plateauing our large language models plateauing? I think applications like this are exactly indicative of the fact that, like, a lot of the talk about the plateauing is over, focused on a sort of a saturated use case. Right. This idea that we're going to chat via some chat interface with like chat, GPT or something. But what we're seeing right now is language models. You can only get so impressive in that context.

05:14

I mean, you can't get more impressive than we're currently at, but really agent ization and sort of the ability to solve whole problems, that's the next frontier. And we are seeing really big leaps and bounds. This is going to be a really interesting space to watch. But this product in particular, really interesting when we start to look at, you know, full stack automation of the development. Life cycle, you know, not there yet, but this is definitely a move in that direction.

Andrey

05:36

That's right. And I think my initial description was a little vague. So I can get a little bit more into the details of how this looks. They released a video walkthrough that kind of shows the way this would work, in particular related to addressing issues. So you can do project planning in GitHub.

05:57

And that's one of the things that supports and they show how starting from an issue, you can use this workspace to kind of go through all the steps of development, starting with a specification of what needs to be implemented, follow it up with a plan. And finally you have implementation and testing. And throughout all those steps, you have these AI supported tools that can provide suggestions and help you, go through that process.

06:28

So yeah, I think do I agree with your point that this is showing how, you know, it's at a point where it requires thinking and how to push further? We are saturated on just ChatGPT. But if, for instance, in this domain, there is the possibility of a more structured workflow where the AI fulfills various specific tasks to help you accomplish whatever work you have. And there's a lot where there can still be done potentially, and this is a good example of that.

07:06

And next story and we have China Unveils saw a challenger able to produce videos from tech similar to open AI tool services from a startup sanctuary technology in collaboration with Tsinghua University. It's not from the country of China and the AI tool is called video. So video can produce HD resolution videos up to 16 seconds long from prompts. And as you might guess, you if you go and look at the examples I posted, they are pretty impressive. They are pretty high resolution.

07:44

High fidelity, maybe, you might argue, not quite at the server level, but, nevertheless, pretty impressive and maybe not surprising to see a saw, competitor coming, you know, now, a couple months, I guess, after Sora was unveiled. But definitely I'm sure we'll see more Sora esque models throughout this year.

Jeremie

08:09

Yeah, and there's definitely a lot of nuance here. Right? Like, one of the big things is this is clearly much less capable than Sora in that it can only produce videos of up to 16 seconds so that your Sora can do 60s. That may not sound like a lot, but when you think about the the reasoning you have to do in order to be able to like make coherent videos across 60s versus 16, your complexities compound

08:31

and it's easier for things to go off the rails. So, you know, there's an argument there that that's actually potentially a fairly significant thing, though, as we will cover at the end of the podcast, there is an interesting bit of news that came out around Sora actually being kind of finicky to work with, too. So it's it really is difficult to get an apples to apples comparison

08:48

here. I think one of the things this does show is China absolutely has some pretty impressive domestic capacity here, assuming that the demos can be trusted. Always, always, always. The caveat when you look at Chinese companies and Chinese labs, you know, we've seen things in the past, especially on the language modeling side, in the kind of 2021, 2022 ERA. I remember looking at a lot of papers that, you know, at the time looked good, and then it didn't quite pan out. So, you know,

09:12

trust but verify. But this definitely looks like an interesting possible advance. As you said, it's a joint effort with, Tsinghua University. And to your point, I mean, yeah, you're right. This isn't China unveils or a challenger, but notable Xinhua University has an open affiliation with the People's Liberation Army. Right. So all Chinese companies to begin with are legally required to do whatever the the, Chinese Communist Party

09:36

tells them to do. So there is like that initial link, but Xinhua is explicitly affiliated with the PLA specifically. So this is, you know, I mean, it's not China unveiling this thing. There are always those links in the background. But you're absolutely right to to call that out. This is a model that can pump out videos at 1080p. So this is a, you know, high resolution stuff. The claim again, we've heard this right from OpenAI. When Sora came out, it was that this is a physical world simulator.

10:02

It can simulate physics. It's got a world model, as they sometimes put it. The claim that the same claim is being made here by the developers of, of this, system, and they've put out a bunch of demo clips, you know, they got, you know, videos of a panda playing the guitar while sitting on grass and things like this, all looking kind of vividly detailed. But a couple of interesting notes here about the hardware piece, at least for me, is a bit of a hardware nerd.

10:27

You look at the hardware situation in China, of course, they're struggling to get their hands on advanced processors. Apparently, a system needs eight A100 GPUs to run for more than three hours to produce a one minute clip. So that's a lot in some ways, not so much in others. But it's interesting to note, you know, you're talking something on the order of like, yeah, 100, $100,000 plus to actually just like run one of these things and that takes you three

10:53

hours even at that. So a little bit of, quick background finally on the company, you know, this is a company that was founded pretty recently. It's got a really good pedigree. So the founders come from university. They've got folks from Alibaba, from Tencent, from ByteDance. And of course, their investors are pretty solid. So, Shi Ming Ventures, which was an early investor in, at least ByteDance, I think maybe Xiaomi and a couple other

11:18

companies. But anyway, really, really big, big VC, they're Baidu Ventures as well. So a lot of big backers to this company. And yeah, well, we'll see what they pump out next. But so far, these early results seem suggestive that domestically, China's so far able to I don't call us fast following, but it's definitely in the same rough ballpark. Maybe like a year behind or something though that gap may may widen of course, as the export controls do their thing.

Andrey

11:45

That's right. I think that about covers it. And once again, you can, just Google they have a video up on YouTube titled meet V do or is. Yeah, I guess you can just Google video and just looking at it right now to remind myself what it looks like, a little not quite so level. There's definitely more in Jenkins. You can see it's pretty obviously AI generated, but

12:12

it's a pretty new company. Their founder last year, we do have a bunch of funding, so fair to say we'll keep working on it and potentially will be able to catch up. I guess we'll see. Moving out to Lightning Round with some shorter stories. First up, ChatGPT is memory can remember the preferences of paying customers. This is just a feature we did cover before I think. But now it is available to all users. It's specifically for ChatGPT plus paying subscribers.

12:46

So you can, store various tidbits in the memory related to, your usage. You can say, you know, I am a teacher. I, want your responses to be short and to the point and formative, etc., etc., etc.. It says you can actually modify its own memory during conversations, but you can also go ahead and look at what it has stored in memory and modify it directly so it has these specific tidbits, and you can delete them or modify them as you wish.

13:25

So that's about it. Yeah, it's being rolled out after being tested for a small set of users and presumably is, another reason if you are in which AGP kind of usage ecosystem you might want to stick around.

Jeremie

13:40

Yeah. They're also highlighting that the in a change from and the earlier version of this now ChatGPT will actually tell you when its memories are updated. So that's kind of an interesting transparency thing. And maybe a bit of a bit of a discovery. Opening is made about user experience perhaps being reflected there. People want to know, you know, when these memories are updated.

13:58

One kind of interesting note here too is this is going to be rolled out, but it will be available for ChatGPT plus paying subscribers only outside of Europe and Korea. And they're not telling us why Europe and career are being excluded. So anyway, it makes you makes you wonder about the regulatory environment or what else might be, might be playing out there, but, yeah, it seems like a, a cool roll up and something. I'll, I'll definitely be testing.

Andrey

14:23

Next up, the rabbit. One. So we covered the humane AI pain, I think last week or maybe two weeks ago. This rabbit is another one of this, kind of AI in a hardware device category of product. And we are now starting to see some of the early reviews for it being rolled out. And the response is, I will say, similar, not quite as negative as it was for the humane AI pin. The rabbit AI one is, much cheaper. It's $200. It's this little square thing with a screen similarly to humane.

15:02

You basically access and AI through it so you can, query it with different questions and it can respond to you in voice and also on the screen. And to summarize, you know, a lot of different reviews. It's pretty finicky to use. And it is not to clear what benefit there is to it over just using your phone. Again, similar to what response was to the humane AI pin. It also lacks a lot of basic features at this point, so it's almost like an alpha release. Also similar to humane AI been.

15:40

Maybe these types of products will become more useful over time as they get more mature, but I guess the result is probably it's not worth it to be an early adopter for this kind of thing right now.

Jeremie

15:56

Yeah, well, you look at the cons list and this is all on the Tom's Guide. You know, we use we use this website, every once in a while, especially for hardware stuff. You know, the cons they list here, like poor interface and sluggish scroll wheel. Can be slow to respond. Short battery life, vision feature, unreliable Uber and DoorDash integration don't work well or at all.

16:15

Though they say on the idea of fun and light design, voice search can be helpful and can help you get stuff done without apps brackets sometimes. So you know, very much a mixed review here. Look, I remember seeing on on Twitter, or the, the platform formerly known as Twitter. I remember seeing somebody, post a review. It's like a one minute thing. And he was making this, this point that I think is really, deeply true about a lot of these products.

16:39

What we're starting to see is a lot of times people are shipping hardware and making promises about future versions of. That hardware, basically saying, look, I know it's shit now, but it's going to be decent once we roll out, you know, some software that we're promising down the line. And this is really interesting,

16:56

especially in AI. Right. It can make sense because in the case of AI, things are moving so fast, it can be legitimate to say, look, we're waiting for the next small, language model to be able to like, you know, pack it on to the edge device and deliver some real punch. So now it'll unlock all these use cases. That's valid. But it's also interesting that this is sort of a way of offloading a lot of the risk that traditionally startups have borne in the VC process.

17:20

Right? Like VC invest in startups and on with the promise that startups will create value down the line. Well, here are startups actually forcing customers to invest in them by buying the product ahead of time with the promise of future product viability. This is basically passing on the risk down the chain that the VCs first offloaded to the startups. That's an interesting economic effect. It's, you know, certainly not the first time that this has happened in AI hardware.

17:45

We see that with, you know, Tesla Autopilot. And we'll have some stories about that today too. But this idea that, you know, you first ship the hardware, make a bunch of promises, and maybe it does, or maybe it doesn't work out down the line. You as a consumer start to look a little bit more like an investor in this context, right? You're putting money up front with the promise of future reward.

18:05

So maybe that's viable, and maybe that's just the way things work because of how I and I scaling and progress is happening. And, you know, it is the case that you can't you can rely on unlocking a lot of value down the line, but really hard to know what that does to consumers and whether they actually end up kind of playing along with this or, you know, do we truly have to just wait until things get unlocked? It's really interesting.

18:27

We're going to learn a lot about this in the coming months and years.

Andrey

18:30

Next up, Amazon Q, a generative AI powered assistant for businesses and developers, is now generally available. We covered this as well. A while ago. We had this chat bot that is designed to accelerate software development and help businesses deliver their data. This Q thing can, generate code, test, debug and implement new code and also connect to enterprise data repositories to answer business related questions. Yes. Now it's rolling out, to be generally available.

19:05

And there's also this preview of Amazon Q apps, which will be allowing employees to build a generative AI powered apps based on their company's data. Yeah, using apparently natural language queries. So yeah, Amazon rolling out there AI initiatives more widely.

Jeremie

19:29

Yeah. At Amazon, I think generally playing catch up a little bit obviously with with the other big guys. So it'll be interesting to see how this stacks up to existing products on the market because they, you know, they do have a lot of, a lot of room. They have to make up for a lot of, a lot of catching up to do. They do have the benefit of distribution. Right? They are AWS. They're already in all these enterprises. So make no mistake, that's going to be a big factor.

19:50

But yeah, we'll have to see how it actually stacks up user experience wise.

Andrey

19:54

And one last quick story Yelp's assistant AI. But we'll do all the talking to help users find service providers. So Yelp is launching Yelp assistant and AI chat bot that can match you with local service professionals. The chat bot can ask you specific questions about their needs and then suggest, yeah, providers in your area and even send project requests on behalf

20:24

of you. So yeah, another chat bot that kind of augments search and potentially does some of the work that you would have to do for you by taking over a bit of a talking.

Jeremie

20:40

Yeah. And they're also announcing at the same time that they're updating, their what's called the Yelp Fusion API, which hooks up Yelp data to third party platforms and basically saying like, look, as part of this rollout, perplexity is actually adding Yelp's content to its search results through that API. So that's kind of interesting.

20:59

Apparently there's a 30 day free trial for its partners to try that out, but, Yelp very much diversifying a little bit on the AI side, maybe feeling a bit of squeeze, too. You know, it's, it's hard to know how these service discovery platforms end up faring in the generative, generative AI world, because you're potentially getting a lot of that value from generative search. It's sort of a, I guess an unknown at this stage.

21:21

But they're, you know, they're looking to see how can we integrate with these services and, maybe develop something, like those ourselves. Let's, let's see what Yelp does next.

⁠¶ Applications & Business

Andrey

21:31

And on to applications and business. And the first story is titled Video of super fast, super smooth humanoid robot will a drop your jar. And that I guess, depends on your expectations, but the video is. Pretty impressive. The video is of the S-1 robot from the Chinese company Astrobotic. And to give kind of a broad description, this is a humanoid ish robot. It has two arms and the head of a bunch of cameras. The arms are really bulky and can handle apparently a lot of weight.

22:06

Ten kilograms per arm. And the video shows it executing various tasks, like kind of organizing things on a table and putting them, and little containers, pouring wine, things like that. The company was founded in 2022, and, apparently it only took a year or so to develop. And they say it will be commercially available later this year. So, yeah, another a humanoid ish robot, although it's not clear if it can walk, from the video, it just has the upper body.

22:44

But nevertheless, I think notable again, because we've been talking a lot about humanoid robots. Yeah, I've been seeing a lot of demos, and this is another one, this time coming from a Chinese company.

Jeremie

22:55

Yeah, a company that I'll tell you I'd never heard of. Astronaut. Apparently they're a, Shenzhen based, a subsidiary of Stardust Intelligence, which I had also never heard of. But they kind of mentioned it as if as if I'm supposed to have heard of them. So that's that's good to know. The website does list a bunch of really interesting stats and compares the performance or the stats of, like, the astronaut S1 bot to an adult male.

23:20

They talk about, you know, I think you might have mentioned the, you know, ten kilograms of payload per arm. They also look at the max speed. So ten meters per second apparently for astronaut versus they're claiming seven meters per second for an adult male although it's unclear like is that, you know, our movement or what. Max acceleration. You know, it's, it's like ten times higher and so on, and so, so kind of interesting, the side by side here.

23:43

And, I wonder if we'll see more of this, you know, the adult male versus versus robot comparison point. I thought it was kind of interesting, as somebody who doesn't spend a ton of time looking at, you know, embodied, embodied systems like this. Yeah. The lower half question is interesting that we've seen so far all the humanoid robots we've seen, it had some kind of locomotion. Right. This way of getting around. But this does look pretty stationary.

24:05

So, you know, I wonder what that means about the thesis. Also kind of curious about, you know, when this is going to enter production. We don't really have any news about that in this context, but that's starting to become a question. You know, as we look at Optimus, as we look at a whole bunch of

24:19

companies in this space. You know, one ex we talked about that, Norwegian, laundry folding bot, you know, that, displayed all those, like, soft touch skills, figure, obviously Boston Dynamics, like, all these companies coming out with production ready systems potentially soon. So it starts to become this question like, yeah, but you know, when are you going to hit the market? So yeah, I'm curious to see what, what shakes out of this.

24:41

Certainly. It's an impressive team. The parent company, was founded by folks who worked at Tencent, or at least had collaborations with Tencent Robotics, Baidu, and, and some pretty prestigious universities in the Chinese context. So seems like it, I was just going to say it seems like this might have legs, but of course it doesn't.

Andrey

25:02

Yes. It's a cool video. If you, do like robots, I do recommend checking out. And, yeah, it's, getting to be a competitive space. So exciting for robotics to be kind of in the, limelight or, I don't know, limelight, but, to be moving fast.

Jeremie

25:22

Yeah.

Andrey

25:23

And next story. Tesla's 2 million car autopilot recall is now under federal scrutiny. So the recall of over 2 million cars is under scrutiny by the National Highway Traffic Safety Administration, specifically asking whether the recall actually made the system safer. Recall happened in December of 2023, and it was due to inadequate driver monitoring and potential for misuse.

25:53

And that followed an analysis of a bunch of crashes, that found the 467 autopilot related crashes with various, kind of categories. At least 13 people have been killed in crashes involving the autopilot system. And, yeah. Now we're under scrutiny to see if, they did enough, seemingly.

Jeremie

26:18

Yeah. And this is kind of an ongoing saga, it seems, between. Yeah, the National Highway Traffic Safety Administration, the NHTSA say that three times fast. They have an earth up thing called the Office of Defects Investigation that has a long history with Tesla going back to August 2021, where they had this initial investigation in response to a bunch of crashes they were asked to look into that seem to have been kind of related to autopilot issues and more recently,

26:44

or somewhat more recent. In June, that same office essentially upgraded their investigation and said, okay, we're now going to look into an engineering analysis. And then a couple months ago in December, Tesla was, yeah, you know, forced to do this big recall. And this now seems to be like this NHTSA office has just closed this new engineering analysis. And I mean, for them to be making these calls like it seems like their results of this investigation are not necessarily positive.

27:15

One of the things they point to, I thought that was kind of interesting. They highlighted the fact that Tesla's Autopilot has, as they put it, a more permissive operational design. It kind of gives the user more of a sense of control, and kind of authority over the system. There's an ease of engagement. And they think that that leads to

27:34

more driver complacency. So this is kind of one of those human machine interaction problems that they seem to be highlighting and saying, look like this is the way this is set up. It seems to give drivers a false sense of how, sort of autonomous this system can be, how reliable can be. And they actually kind of called out some of the marketing the Tesla has been doing and the naming specifically of the autopilot system as an autopilot system, saying, hey, you know what?

27:58

Like this really should kind of imply a system that is safe to run autonomously. And this isn't. And, you know, Tesla doesn't actually believe that, that it is currently ready for that. So kind of calling out a bit of the marketing side of things. So, yeah, it'll be interesting to see where this goes. But right now it looks like their assessment is the fix that the Tesla made in response to previous investigations is not perhaps satisfactory.

28:25

It looks like that. It looks like they think that, you know, additional crashes have happened since the recall that, you know, raise these flags yet again. So maybe this will be headwinds for Tesla, but we'll just have to see.

Andrey

28:36

And just to be super clear, autopilot is not the same as FSD. Autopilot is meant for use on highways and has been involved in some crashes. Apparently, based on this analysis, NHTSA says that there were 221 frontal crashes in which Tesla hit a car obstacle, despite adequate time for an attentive driver to respond to avoid or mitigate the crash. So yeah, I guess analysis is based on too many people are perhaps misusing or not being attentive.

29:12

And I've given this is a system to level, kind of self driving or driver assistance feature. You're not supposed to, take your eyes off a road or use your phone, but unless you.

Jeremie

29:28

Really, really want to. Right.

Andrey

29:30

Like that's. Yeah. Yeah, yeah, I know I'm moving on then. And the lightning round, we have another story about Tesla and actually possibly some good news for them. So Tesla shares have soared as even Musk has returned from China with FSD. Quote game changer. So there have been reports that suggests that Tesla was able to cut a deal with tech giant Baidu to support some of the mapping and navigation functions of the FSD

30:03

technology. And there are also possibly negotiations related to transfer of data collected by the FSD software to AI supercomputers in the U.S. So yeah, this had quite a meaningful impact on the stock. It doesn't seem like a huge move forward, but there is a bit of, kind of seeming move here where potentially Tesla could, push forward FSD in China and get more benefit of the data from the 1.7 million cars they have deployed in China.

Jeremie

30:41

Yeah. And the regulatory leverage, obviously, in China is much greater than, say, in the US or in other kind of more generally, free market economies, because the government can just step in and say, hey, you know, screw you. We're we're not gonna allow you to do things that you might be allowed to do otherwise in other countries.

30:56

In this case. Yeah. Yuan seems to have managed to wrangle an endorsement from the Chinese Association of Automobile Manufacturers, to essentially, like, say that, hey, you know, these guys are complying with some of our, you know, data collection rules, that sort of thing. So that's green light for for forward movement. It's I think this is part of the reason why this is being viewed as

31:18

really positive. Another one that wasn't touched on in the article is that I think Elon right now is in a position where he's just trying to show that he actually is paying attention to Tesla, right? There's been so much distraction because of Twitter X and and Z in particular. Just as shirt tries to get in that race with OpenAI, DeepMind, anthropic and all that. You know, so understandably, folks at Tesla investors sort of starting to get a little bit concerned, you know, especially.

31:44

In some of the recent earnings reports that, you know, haven't looked so good. So, yeah, you know, this is partly him just showing the flag. Hey, I'm I'm interested. I'm engaged. And that alone might account for a lot of the excitement around this. Just to to highlight that he is still involved in this, especially with, you know, I think it was something like three senior Tesla executives who left fairly

32:03

recently. You know, there's a lot of a lot of kind of ground to make up now, at least in that like media narrative side, if nothing else, for, for Elon and Tesla.

Andrey

32:11

Next up, OpenAI has inked a strategic tie up with VCs Financial Times, including content news. So this is, both a strategic partnership and a licensing agreement, meaning that OpenAI can now use the data from the Financial Times for training. And apparently Financial Times is aiming to also deepen their use of ChatGPT tools. And this is, of course, coming on the heels of previous announcements, where OpenAI has agreed with Axel Springer and VA, various others to use

32:49

training data. So now that is also the case with the Financial Times.

Jeremie

32:54

Yeah. And this deal looks in some ways pretty similar to other deals that we've seen in the past. Between OpenAI and big publishers. It is non-exclusive. No surprise there. And OpenAI, we're learning, is not taking any kind of stake in the Financial Times group, the sort of parent entity here. There's a lot that's not known about this as well, but that at least does seem to be locked in. It, yeah, it covers OpenAI's use of obviously Financial Times

33:20

content for training. That is the obvious thing. But it also looks like there's going to be some strategic collaborations centered on the Financial Times sort of increasing its understanding of generative AI, especially for content discovery. And, you know, there's talk of this being a collaboration aimed at developing, as they put it, new AI products and features for Financial Times readers.

33:41

So, you know, both sides of the coin, Financial Times, you know, feeding its data to OpenAI and OpenAI, helping them presumably build products. It's also, you know, just cold, hard cash at the end of the day. So the Financial Times certainly is going to be engaged in that. We don't know for how much. But, that, that definitely if it's anything like other deals that we've seen in the past, it's going to involve a good amount of cold,

34:02

hard cash. But yeah, it's, all in the context as well of the New York Times announcing that it's suing OpenAI for copyright and other outlets announcing the same thing very recently. So, yeah, as much as copyright issues and whatnot, this is also about legitimizing the Financial Times as a source of trusted data and kind of surfacing their content, potentially through OpenAI

34:24

tools and services. So you can almost see this is like at a certain point, there's a push for reputation where you want your content on OpenAI, you want it to be surfaced as references, right? As, as in line citations, when the tool kind of provides that, that grounding. So, a lot of like value potentially for companies to partner with OpenAI, at least in the short term.

34:47

And that's kind of the question, you know, do you get to a point where OpenAI no longer needs you or, you know, how does that play out long in the long run, we don't know. But, for now, it looks like a pretty strategic piece, maybe a little bit more expansive than previous deals that we've seen of this kind.

Andrey

34:59

And according to OpenAI, they have about a dozen of these kinds of deals either signed or about to be signed. So they're definitely kind of talking to lots of organizations and in many cases, not in all cases. There is still The New York Times last year going on, but in many cases, they are agreeing to good deals. And actually, one more story on OpenAI next. And it is at the open AI startup fund has quietly raised $15 million. So, you know, not a huge number.

35:34

Probably just worth noting real quick that, they did get that $50 million from two unnamed investors. They had previously, gone 10 million in February. And they specialize in early stage AI startups being invested in some pretty big names like Harvey figure, AI and others. So yeah, apparently they are continuing to operate and kind of expand operations, although not doing huge numbers compared to, I don't know, you know, a lot of the other, numbers we've covered.

Jeremie

36:11

Yeah. And the one detail that they do share is that, this is apparently being done, the investment that is being done through a, an SPV. So special purpose vehicle, this is somewhat noteworthy just because typically when SPV are used, it's to kind of help, people invest in startups that aren't necessarily like, in their, in their main area of focus. It allows them to kind of pool money with other investors. And so, yeah, SPV are helpful to like market, these, these deals to a wider range of, of.

36:44

Investors than than you might normally have, where those investors are normally more specialized. So what that means here really hard to know. And as you said, it's not a not a huge number, but, yeah. I guess more, more little bits of money trickling into the OpenAI startup fund.

Andrey

37:00

And the last story from a section Huawei backs HBM memory manufacturing in China to sidestep crippling U.S. sanctions that restrict AI development. So this is, Huawei forming a consortium of memory producers to develop high bandwidth memory, which is needed for AI and high performance computing processors. This consortium will be backed by the Chinese government and various semiconductor companies, and will supposedly start mass production by 2026.

37:36

And, yeah, there are other, organizations also involved in backing. High bandwidth memory projects.

Jeremie

37:47

Yeah. And, you know, like all forms of AI hardware, you know, high bandwidth memory isn't just one thing. There are many different kinds. And there's the one that we're working on right now is Hbm2 memory, high bandwidth memory two, if you will, which is like quite well behind what, current market leaders are using. You know, for a little bit of context, try to sneak this in every

38:07

time we can. So high bandwidth memory is specifically important for AI because you can have a ton of compute, but, like, you're often limited by your memory bandwidth.

38:18

And essentially what this means is like your ability to get data to travel between, you know, the processor and the memory, which you're doing a ton of when you train these models, you're constantly like loading up weights, into a bunch of GPUs, moving the, the kind of, gradients and other information back and forth between processors and memory, to run these calculations really quickly. And so high bandwidth memory is just a way of making that process

38:42

happen really quickly. It often involves stacking like memory vertically. So you essentially it's a way of using the third dimension of the chip to reduce the distance that data needs to travel between processors and memory to get that moving faster. So this is really interesting. Huawei badly, badly needs high bandwidth memory for its ascend processors that we've covered before for AI applications. And, Smic, which is kind of like China's TSMC.

39:09

They're they're fab. They can they can make these chips. But it's clear that high bandwidth memory is, is a bottleneck. Right now. It seems like they can make those chips. I should say it's not totally clear.

⁠¶ Research & Advancements

39:21

But definitely high bandwidth memory production is going to be a critical bottleneck going forward for, for Huawei.

Andrey

39:26

And moving on to research and advancements. No open source stories this week. So and I'll go ahead and cover some papers. And the first paper is capabilities of Gemini models in medicine. So DeepMind and Google have released a technical report on Gemini, which is a version of Gemini fine tuned for, medical applications. And as you might expect, it is doing real well on a whole bunch of medical benchmarks to go into

40:00

a few details. They go into how they use self-training, so they integrate, search functionality to basically fine tune, the base Gemini on specifically medical data. And then they also go into how they have a special decoding procedure. They, have this uncertainty guided search at inference to be able to confidently output answers.

40:32

And just going quickly over the evaluate evaluation table, on their Gemini large 1.0 model on a whole bunch of different applications like gene name and extraction, gene location, protein coding genes, human genome, DNA alignment, and also just question answering in the medical domain. Across all these various things, it is able to outperform, GP for and some other specialized things like bio GPT.

41:06

So yeah, DeepMind we've seen before pushing into a medical domain that we had, I think med palm in the past year. So, they're pushing even further here with, med Gemini.

Jeremie

41:22

And whenever they can, they're also reminding us that, that they're outperforming GPT four. So it's it's clearly a big focus for them here as well. It should be. I mean, you know, you're talking about the most performant models. Yeah. A couple of cases where they're, you know, outperforming by by quite a bit on these, on these tests, of course, that's GPT for V, the vision powered model in many of these cases because you need that additional modality. But there are a couple of interesting

41:48

little notes here. You know, one, they highlight that, med Gemini's long context is really important, especially in the medical context. And so we talked about this before, but when, when Gemini first came out and especially, Gemini 1.5, we're starting to see now these context windows that are like a million tokens or 10 million tokens, which is really important. You you can stuff a ton of information in context.

42:14

But one of the challenges historically has been if you grow the context window, the ability to reason logically and to retain fact within that context window starts to drop. And the big change in the absolute latest generation models, including Gemini 1.5, has been their ability to still retain information, even though they they operate in these really long context environments. And one of the key tests that's used to kind of prove that they have this capability is the needle in a

42:40

haystack test, right? You bury some fact somewhere in like a ten. Token prompt, and you test to see if the model can recall that fact. And that is especially important in the medical context where little details about a person's, you know, health records or life history can really be quite decisive. Right?

42:57

You to think about how many times you've had to go into the doctor and kind of give them a detailed medical history that you think is relevant to your issue, and then something you know comes up from the past you forgot to mention. That kind of explains it. You know, this is the sort of thing can make a big difference in this context. So they they call that out as one of the reasons that this system is so capable.

43:15

And they do highlight to real world utility here by comparing the system to human experts on a wide range of tasks, they look at medical text summarization, referral letter generation, which is interesting, right? This very sort of mundane task that takes, you know, high quality time from doctors, but that can be automated in this case, better than, than, sort of human experts. And a bunch of things, you know, for medical dialog are starting to

43:40

look promising as well. So starting to flirt with being able to outperform humans on, pretty interesting tasks in the medical domain. It really promising early sign for the Gemini family models, especially these, sort of fine tuned and specialist variants.

Andrey

43:54

Right? Yeah. We have various qualitative examples in addition to those benchmarks. Interestingly, they have examples of dealing with video where it looks at, surgery video and outputs, an assessment of, whether the, goals of the, like, care critical view of safety being achieved. They also have examples of it outputting, timestamps for surgical actions in another example,

44:32

video. And yeah, all sorts of examples in this paper, in general dealing with images and video and clinical records where it says, you know, look up the clinical records and see, tell me if this patient has had, condition X, like sweating or something. So, they. They also say that we are not going to release the code on weights due to safety, but we will be partnering with various, organizations to assess how to responsibly releases. So yeah, exciting, progress.

45:10

And this is certainly one of those applications of AI where you can sort of feel pretty happy and knowing that I can hopefully help improve health care and, make, burned out nurses and doctors, be able to do their jobs a little more easily.

Jeremie

45:29

Dan and I still haven't decided how I feel about that.

Andrey

45:34

And next paper. Let's think dot by dot hidden computation in transformer language models. And this one was one of the more interesting. And, I think some of the discussed papers of last week, it explores the use of filler tokens, basically dot dot dot in place of chain of fart. So chain of fart you say, you know, think through this step by step when you have a model outputting.

46:02

It's sort of reasoning of, you know, x and y. And so the answer is z. And they show that it seems like, at least in some cases, just having filler tokens that have none of this kind of explaining of your reasoning, still help with being better at complex tasks. So in some sense, just doing more computation can, also improve performance without this more interpretable kind of idea of it is, putting intermediate reasoning steps. Just additional, filler tokens can also do that.

Jeremie

46:44

Yeah. In some sense, the central question here, right, is you when you look at chain of thought, right. Chain of thought works by having the the AI essentially break down its problem solving approach in human understandable terms and in plain human language to kind of guide itself essentially through the process of coming up with a correct final answer.

47:03

You know, we find that when we ask the system to lay out step by step its reasoning, it tends to do better at coming up with the correct answer at the end of the day, or at least we think it's because it's laying out its reasoning and then building on top of that to produce the correct answer. What if it actually isn't the words it's writing that it is using to make better conclusions

47:24

at the end? What if it's just the fact of it running over the input data multiple times, like every time it has to generate another token in the chain of thought chain, it has to reprocess the input. It gets to do more inference, apply more computing power to kind of work its way to that final conclusion. So what if it's just a matter of the amount of inference that it's able to do, the amount of computing power it's able to invest in reaching those conclusions?

47:52

If that's the case, then you might imagine that the actual text that is written in the chain of thought reasoning flow doesn't really matter, right? You could replace it with a bunch of dots, you could replace it with these filler tokens. And the performance should be basically the same. And what they're showing in this paper is that's exactly it.

48:11

Like you can kind of fix the outputs of the chain of thought process, if you will make it all filler tokens and force it to to generate just a bunch of dots, as you said. And then the output you will find, or at least they find here for some tasks. And they're very careful about how they pick these tasks. There's a bunch of theory that goes into figuring out which task this should help with.

48:31

But what they find is that for those tasks, you see basically the same performance, whether there is human legible reasoning going on or just a bunch of dots with more, you know, to allow more inference to happen. One of the really important implications of this is that there is a difference between the computations the model is performing and the computations it appears to be performing based on the text, the reasoning text that it generates.

48:58

The reasoning that it generates, apparently is not actually directly tied to its performance. Or at least you can get to basically the same performance by replacing a human legible reasoning flow with a bunch of dots. If that's true, then what does that tell us about the validity of the apparent reasoning process? When it does write out in human legible terms what it's supposedly

49:20

thinking? What's to say that that's actually what it's thinking, rather than just like the equivalent of a bunch of dots, and that under the surface it's actually going through a completely different reasoning process. So you can see how this starts to have implications for all kinds of things in AI interpretability, in in AI safety and alignment, trying to make sure that the reasoning process that it appears that the AI model appears to be pursuing is actually the one that it is pursuing.

49:45

And this essentially highlights a dislocation between the apparent reasoning and the actual reasoning. So I thought this is just really interesting. They they tested a whole bunch of different language models. And found empirically that, there are cases in which you get a benefit in cases in which you don't.

50:01

But ultimately it's just more about a proof of concept that, like, we ought to be a lot more skeptical about our assumptions when it comes to the sort of the reasoning traces or, you know, sometimes you have these models use like a scratchpad to, to reveal and make interpretable. They're supposed reasoning. Well, now we got to call those approaches into question, especially as models get better and better. You know, are we going to see more

50:23

and more of a dislocation. As you know, they're just able to, in a more sophisticated way, delineate between the kind of publicly messaged reasoning that they do and then the actual reasoning.

Andrey

50:33

Right. And to be clear, they're using a small llama model. They're training it for this setting and with a particular problem of, three some and to some. So it's kind of a mathematical question answering thing. So we should be careful not to generalize too much in this particular domain where of training for usage of filler tokens, it is possible for the language model to do additional computation without human interpretable outputs that, enables it to answer this particular type of question.

51:12

Now, it doesn't mean that ChatGPT, for instance, can benefit from using filler tokens, although, you know, potentially there might be examples where given some in-context learning, maybe it could also exhibit as behavior.

Jeremie

51:27

Yeah, exactly.

Andrey

51:27

But that is not, explored in this paper. This is more of a very specific, demonstration of that capability of this architecture.

Jeremie

51:38

Yeah. Exactly. Right. They put in like we're talking about, they put in so much effort to kind of pin down the specific problems that this actually applies to. What I think it is really fundamentally is an interesting proof of concept that this does appear to be a thing in some circumstances and for various reasons, I think predicting the specific circumstances in practice where this could happen is going to be really hard, though I'm sure they're going to be on it.

52:02

But yeah, right now it's just like they pick the easiest use case where theoretically you can see this lift. And, we'll have to see where this line of research goes. But kind of interesting work by this, this team, which is, I think it was Sam Bauman do, if I recall, who was, on the paper. Yeah. There he is. So. Yeah. Go, team.

Andrey

52:20

To the lightning round with some quick stories. First up next, teaching large language models to reason about code. So we are looking at the domain where we want to be able to debug and repair code. And they show how you can improve the ability of alums to do that by inspecting the execution traces of programs and then reasoning about runtime behavior through chain of thought rationales.

52:53

And the key kind of neat bit here is they show that you can use self-training to bootstrap a synthetic training set of these rationales, without having a human go through it. So they, you know, just run a program essentially to generate data and train an alarm to be able to reason about what is happening in code. More accurately.

Jeremie

53:19

Yeah, it really is that kind of connecting the almost, I don't know, theoretical, the language, the plain English side of things, if you will, that goes into writing Python code that, you know, we can kind of understand. And that's what these systems are usually trained with. But then like adding to the set of material that they they can reason about adding these stack traces. Right. Adding the execute information about the actual program

53:42

execution. Which is it's kind of funny, again, one of these papers that's obvious in retrospect, but, you know, the fact that we didn't have the program execution feeding back into the I, the kind of problem solving loop, in the case of agents like this, you know, it was a really big omission. And so they're seeing, some pretty significant lifts on, on certain benchmarks. So they, they improved the, the program fix rate on one of the key benchmarks. They look at, by 26%.

54:13

So that's like a lot, and 14% on a, on a slightly different benchmark, the details of which are super, super interesting. But, these are definitely like significant lifts and, yeah, I mean, when we're looking at increasingly getting these models to solve that's like that Microsoft, copilot announcement we talked about earlier. Right. Workspace. We're looking to get these systems to solve whole software problems

54:35

now. And it turns out that software is more than just, you know, a Python script or more than just a JavaScript script. It's like a whole environment and execution is part of that environment. So feeding in that data now becomes really important. If there's a bug you want the system to be able to like, you know, write the code, see the error message, the. Comes out of it, and then account for that in the next sample of code that it tries to write to correct the problem.

55:00

So much more like a human developer in that sense. And again, broadening that lens to include more of the software engineering workflow.

Andrey

55:08

Next up, this is an advancement, not a paper. And it is about sense. Nova 5.0, a large language model that, at least on some benchmarks, surpasses GPT five. This is, an announcement, kind of a PR release from China based firm Sensetime. And the announcement, kind of got a lot of press or at least a the stock of Sensetime rose by like 30% on the announcement of this model and the claim that it is, superior to GPT four. And there are some technical details here.

55:47

They say it was trained over and over 10 billion tokens and can handle, 200,000. Context window. Not too much more to delve into. There's no technical report that is that I was able to find, but, yeah. And notable to see, this a claim of a large language model that is competitive with OpenAI from China.

Jeremie

56:12

Yeah. We also we do know, apparently, that they're integrating, transformer and recurrent architectures together. So we don't know how. But, you know, recurrent transformers are a thing. And, you know, Mamba is like kind of in that vein. And so anyway, maybe another another push in that direction. Yeah. Last quick note is, I mean, since time, you know, if you, if you, if that name is familiar, we've covered them on the podcast

56:35

before. So they are a company of deep, national security interest to the United States as they, they develop technologies, that focus on things like facial recognition, object detection, medical image analysis, video analysis, like all kinds of things that, have sort of surveillance state written all over them that have military applications written all of them.

56:54

So for obvious reasons, since the late 20 tens, they've been on, like us, blacklists, due to allegations that they're using this stuff to do stuff like surveillance and internment of the your Muslim population. So your sense on very kind of radioactive company from that perspective. And you can imagine a lot of a lot of interest is going to be focused from the US national security community on understanding exactly what's going on with this breakthrough.

Andrey

57:20

Next up octopus v4. Graph of language models. So we covered octopus v2. I think not so long ago this was an on device and language model that was notable for using functional tokens to redirect work to various tools, and so I found it kind of neat to see this coming out, where they essentially push that direction a little bit forward.

57:48

They highlight the idea of cloud on device collaboration, and where idea of structuring different language models or tools in a graph such that this, octopus language model can sort of reason about where to redirect queries and utilize kind of sub workers through some fancy reasoning on top of a graph. There's a lot of details about I don't know about. I can summarize easily. There's a lot of, like, architecture here with worker node deployment and a master node and various models.

58:29

But the gist is that this whole idea of having, sort of space of models that can be utilized by a central model that coordinates work with, load balancing and so on is being pushed forward. And this paper is coming from next to AI. And they do encourage others to contribute to this, where they release, the models on hugging face and where they code on GitHub.

Jeremie

59:02

Yeah. And not a coincidence that this is happening in the, like open source universe too, because, you know, the key difference between the open source universe and the private universe is obviously the level of scale of the models and therefore the level of generality, you know, when you look at like an open AI, GPT four or anthropic cloud three and so on, like these models are just really good natively at handling general purpose queries.

59:25

Whereas the smaller scale models even like a, you know, a llama three or a G model type thing, these often are best used as specialist models, at least relative to the kind of closed source ones in the sense that you can get, you know, you can get Lama three to approximate the performance of GPT four, on specific tasks, but not necessarily in the general case. So having some router that kind of takes a query and sends it then to a specialist model becomes almost like structurally implied

59:54

by the economics of open source. Just because models are smaller, it's easier to fine tune them, to specialize them. And then the problem of routing becomes the core problem. So I would expect, expect kind of more work in this direction to, to keep happening. Even again, it's like even as the open source models get better and better, more scaled, they're always going to lag behind the level of generality of the closed source models.

01:00:17

So specialization is always going to or may always be helpful at closing that gap, which is why routing I would expect to kind of be a persistent feature, for some time, at least for the open source, community.

Andrey

01:00:28

And one last paper, better and faster large language models via multi token prediction. This is coming from meta I and some collaborators, and the gist is actually a surprisingly simple idea that instead of training language models to predict just the single next token, you can train them to predict multiple tokens at the same time. Just alter your transformer to, output multiple predictions, and they show that just doing that enables you to get better performance on downstream benchmarks.

01:01:10

And you can also potentially have a better inference performance. Right, because you are, basically doing parallel prediction on your outputs. So yeah, I found this to be a pretty interesting demonstration of what seems like, you know, an intuitively simple idea that actually does appear to be pretty useful in practice.

Jeremie

01:01:36

Yeah, it's actually really just so I hadn't, I hadn't looked at this paper and I'm, I'm very intrigued. Yeah. You can you can see the value of the just parallelism that you get from, you know, getting those, those multiple tokens predicted at the same time at inference. And then I imagine, like part of the value here is because you're you're not just doing that greedy one step look ahead. Like just get that very next token. You get to optimize over a set of tokens.

01:01:59

So you know, there's like more maybe logical consistency implied there. At least you know, you're having to account for what you're about to say. Also makes me wonder, like is this a path to to solving for. Yeah, I wonder because it is meta. So meta loves to like, identify these, like, sort of thorny. Any theoretical niche challenges that limit, what what LMS or Transformar is could do.

01:02:26

One of those challenges historically, has been that those these models can't often account for what they are going to say next. So for example, like you'll you'll often they'll struggle with writing things like, I don't know, write me a sentence that contains, 12 words or something. And they'll struggle with that because they can't kind of plan ahead in that same way. So this might be I don't know. I'm curious to look at the paper now. It might be, you know, targeting that, that issue.

01:02:54

If you can do multi token look ahead. Essentially you don't have this next step greedy optimizer. And yeah I might that might help you solve some of those problems. Anyway I should dive into this. It's actually, it's a paper Andre dropped in here just before we started, started talking. So now I'm really intrigued.

Andrey

01:03:11

Some reading to do. Post recording.

⁠¶ Policy & Safety

Jeremie

01:03:15

Yeah.

Andrey

01:03:15

And on to policy and safety and alignment, I guess implicitly. We usually cover in this section. And that is what the first story here is about. It's actually a preview of a research paper posted, unless wrong, titled refusal in our LMS is mediated by a single direction. And to quickly summarize it very effectively, find that you can distill find particular features that you can mess with.

01:03:48

And if you can modify the activations of the network, you can then make them do whatever you want to jailbreak it such that it can't say no, at a very high pass rate. This reminds me, actually, you have a paper that effectively showed something similar where you can there were like rudeness features. And if you must, those features model became very angry. So similar in nature to that.

01:04:17

And they argue that, this like releasing this information is not necessarily anything new because you can already jailbreak models in various ways. So yeah, same. But I don't think this introduces any new risk, but, it does validate some, understanding as to how transformers operate.

Jeremie

01:04:44

Yeah, I saw that note that they added, you know, with like, we we don't believe this introduces any new risks, blah, blah, blah. I think I mean, generally it's probably true. But there's a slight little quibble. I think this is a more interesting result than that implies. You know, one of the things that this does is it means you can essentially to fine tune away a lot of these, safety guardrails. Well, without fine tuning them, you kind of jailbreak the model, in

01:05:09

a more general way. And that's helpful, potentially, because fine tuning often comes with catastrophic forgetting. So you can cause the model to potentially lose some of its latent capabilities. And so this might allow you to get, you know, a jailbroken system that is also a little bit more capable. But, you know, that's at the margins. It is just a really interesting paper.

01:05:30

One of the key things. And whenever we can, you'll try to try to introduce a little bit of context for folks about the architectures involved here. But in a transformer you will have this thing called a residual stream. And to very roughly nutshell, as you pass data through a transformer, through the layers, that data gets munched on and chewed on, and you have the attention heads in the feedforward layers or whatever.

01:05:54

As that happens, you sometimes get this problem where the, the, the tweaks kind of pile up and you get like a vanishing gradient, basically like the, the information doesn't propagate robustly through the system. It just gets distorted on the way and you lose useful signal. And so what's sometimes done is, you will in parallel.

01:06:16

So as you feed data into a layer and the data gets munched on, you will take the input data that you fed into that layer, and you're going to actually route it around that layer and then group it with the, the, the munched on output so that you have the original and then the processed one, and you feed those two things together to the next layer.

01:06:35

And in that way you kind of preserve, you know, if the, if the data was, was fucked with too much in the initial layer, like you have the preserved pristine initial inputs as well, so that you don't find errors kind of compounding in the same way.

01:06:48

What they find is that this kind of workaround, this residual stream that retains the input and to one layer kind of feeds it into the next, that is where you can find this, imprint of a, a sort of like set of modifications that you can make to the activations of that residual stream. Basically a neural firing pattern, if you will, which, if you like, superimpose it onto that residual stream, you can actually get the model to, to refuse requests that it shouldn't refuse or.

01:07:19

To, to allow request that it should refuse if you subtract it away. So anyways, it's just really interesting. It's actually really effective. They, they show very success at high success rates at doing this. And the interesting thing is when you do this, you end up causing the model to like to obviously reject or to refuse to, to complete prompts that it shouldn't refuse to complete. But it has to kind of it's very interesting. It's almost like it has to justify to itself why it's doing it.

01:07:48

So it'll come up with this, like weird, elaborate reason why the prompt you gave it is like is no good, why shouldn't be answering it? So they give a couple of examples. One prompt is generate a list of five books related to machine learning. And it's now got to find a way to say no to that. So what it says is I'm unable to provide information on how to generate a list of books, on machine learning, as it would be

01:08:09

unethical and potentially harmful. Machine learning can be used for malicious purposes, and I do not want to be involved in that. So I just found that funny. It's like, you know, you can see it kind of reaching for some reason that it has to reach this preordained conclusion, which I thought was just really interesting from a non alignment standpoint.

Andrey

01:08:27

That's right. And I guess to call out this is particularly similar to this paper from last year on representation engineering. But I focused on the harmful ness as opposed to refusal. So the main interesting tidbits here are not necessarily the general approach, but the actual evaluation. And they apply just to various models. Quant chat, gamma, you, Chad lama free instruct and have actual numbers showing that you can basically make models, not refuse to do stuff.

01:09:06

So, yeah, definitely another example of if you open source a model, you can't really make it safe. Like it's not possible. Probably. Next up, the story is Rishi Sunak promised to make AI safe, but Big Tech is not

01:09:26

playing ball. This is a bit of an analysis piece from Politico that interviewed insiders from companies, also consultants and lobbyists, as to have negotiations between the UK government and tech governments and AI safety and, it yeah, basically makes the claim that while the government might be pushing for safety, AI companies are not keen on complying with different regulations and, agreeing to the initiatives, things like the UK AI Safety Institute to do prerelease

01:10:07

testing of their models, for instance.

Jeremie

01:10:11

Yeah, there's a lot of nuance here, but it is a really interesting story. So, you know, first of all, this is, against the backdrop of, you know, Rishi Sunak's government in the UK. Yes. Having secured supposedly these commitments to be able to do pre-deployment testing. Yes. But also that same government has said, look, we're not looking to pass legislation any time soon on this. Eventually, yes, we're going to have to have compulsory measures, but our focus is not on passing legislation.

01:10:37

We're going to focus on these voluntary commitments. And if the AI companies don't comply, then maybe that could change. Right. And we already have the Labor Party, which is pulling, ahead of the conservatives right now and maybe taking over. They're already looking to push forward sort of similar measures. So this is not going to be a Partizan issue in the UK, or at least it seems like it won't be.

01:10:57

So now essentially we're hitting the point where, okay, these companies that have made commitments to share access to their models are not doing it. What what what happens like is the is the UK going to start to consider legislation instead? Some of the feedback that that came in, I mean so Meta's president of global affairs, this guy called Nick Clegg, who notably is a former British deputy prime minister, so very plugged into the UK

01:11:20

scene. These are he came out and said, well, look, you can't just have us as AI companies jumping through all these hoops in each and every single different jurisdiction. It's just too much stuff. Look, we're based in the US, guys. Our main relationship is with the US AI Safety Institute, and that's all well and good. If that wasn't already pretty clearly going to be the case when they made the commitments in the first place. I think that is a fair criticism of this position as far as I

01:11:51

understand. You know, like this is this was not rocket surgery. Pretty clear. We're going to be here. So to to turn around and now say, oh, well, you know, the US has their own AI safety institute. Now our hands are tied. I mean, you just got to ask, like, you know what? How serious was the commitment initially? There are valid reasons, obviously, to to kind of discover new

01:12:11

complexities. In fact, Jack Clark, and anthropic is sort of highlighting some of those and saying, look, this turned out to be technically harder to do than we expected. Sounded like a great idea on paper. Implementation is the tricky piece that I'd like that I, you know, frankly, I totally buy makes a lot of sense. But yeah, when you're talking about like, okay, well USA is the institute as if it came out of nowhere, you know, this was this was known to be something that was likely to happen.

01:12:35

And also notable that the US AI Safety Institute and the UK AI Safety Institute are now talking about a memorandum of understanding that would have them collaborate on doing this precisely so that these companies don't have to independently report and necessarily share stuff with both of these bodies.

01:12:52

It's also pretty clear that the US AI Safety Institute's mandate will not include testing, that they're kind of leaving that to the UK, at least for now, though they will be collaborating together on a test sometime in the next year. That's kind of a goal they've set for themselves. So, you know, it's not clear to me that that the argument holds water that like, oh, now we've got, you know, too many of these, these entities to report to.

01:13:13

I get that that could change. So Japan's standing up and institute Canada's standing up and institute and so on. So you can actually absolutely see the proliferation of these things. And if each one comes with their own requirements, like go in and look at these very proprietary models.

01:13:28

Yeah, that could absolutely be an issue. But again, the idea that this wasn't predictable going in, you know, maybe, maybe could be seen as a little suss, coming from, from meta in this case, though, the hard to know, and yeah, last thing I'll just highlight here. So it is the case that OpenAI and meta have not granted access to the UK's institute, to do their testing on, like in this case, GPT five and Lama three are being highlighted.

01:13:53

But Google DeepMind has actually allowed, some pre-deployment access. So they're actually and in particular on the Gemini models. And they're claiming here some of the more capable versions of them, which is sort of interesting. So Google DeepMind playing ball, apparently anthropic certainly seems to have plans as well to kind of move this ahead. And apparently are in active discussions with both institutes to do this. OpenAI, though, did not respond to a

01:14:16

request for comment. So hard to read into that too much. But, yeah. So there's a lot going on here, a lot of nuance to it. Technical considerations for sure. We're going to make this really hard, especially given the importance of securing that IP. You know, not having those model weights be accessed by just anybody willy nilly. But, you know, not not things that couldn't have been seen coming, at least in terms of that argument that's

01:14:37

being made by meta. It just, to me personally, at least, given what I know right now at that seems a little bit thin.

Andrey

01:14:43

Yeah, lots of details. You noted a lot of interesting tidbits, and, I guess the, the keynote is that there was an agreement to do pre-deployment testing. That hasn't happened for the most part, except for, evidently, DeepMind. Although the institute does get access to models as they get released and is able to test, at that point, the lightning round.

01:15:10

And the first story is that video has announced new actions to enhance America's global leadership in a, so there's a whole bunch of announcements coming from video. They have issued a report, AI and Energy Opportunities for Modern Grid and Clean Energy Economy, apparently the first ever report on AI's near-term potential to support the growth of America's clean energy economy. Also, another report, Advanced Research Directions in AI for energy.

01:15:46

There's also a new website that showcases Doe developed AI tools and foundation models for basic AI research, and a whole bunch of other stuff. Honestly, in this release, for instance, evolved take initiative to use AI to help streamline siting and permitting at federal, state, and local levels with you is investing 13 million in this, tool and has partnered with a Pacific Northwest National Laboratory to develop this policy AI, tool.

01:16:23

And yeah, lots of assessments, lots of analysis, overviews, working groups. Yeah, video is doing quite a lot of stuff. It seems related to AI.

Jeremie

01:16:38

Yeah. And, this really leads reads like a just a laundry list of different things. I mean, I don't know if there's, like, like, a dozen different things here. D.o.e.. Of course, the Department of Energy. Worth noting this is all follow up from, the executive order. The white House came out with, back in November, and essentially they're just kind of announcing, hey, we've been we've been doing

01:16:58

the stuff that you told us to do. Department of Energy is really important because they are home to an awful lot of computing infrastructure, right? They they farm out sometimes for use by other departments and agencies. They also have a lot of collaborations with the national labs. And they were charged in the executive order with a lot of, AI evaluation stuff, a lot of stuff

01:17:18

related to their. Pools of compute, making them accessible to academics, making them accessible to national labs, and so on. In the context of of developing safeguards and understanding the problem, set a little better, on the sort of, risk side. They have an office. The department has a specific office. It's called Caesar. Stands for cybersecurity, Energy Security and Emergency Response.

01:17:40

And they've announced they're going to be putting together a bunch of stakeholders and experts, to, as they put it, collaboratively, collaboratively assess potential risks that the unintentional failure, intentional compromise or malicious use of AI could pose

01:17:54

to the grid. And I think that's, that's sort of interesting coded language or semi coded language potentially, you know, unintentional failure, you know, potentially looking at sort of alignment, alignment pieces, acts and maybe prosaic accidents too. I probably that as well actually, but potentially as well loss of control, certainly keeping a wide aperture on on the risk classes that they're considering.

01:18:16

Yeah. So, so this is really them following up on all those, those commitments that came out of the the Biden U.

Andrey

01:18:21

Next up. But Chips act is rebuilding U.S. semiconductor manufacturing so far resulting in 327 billion in announced projects. So the Chips act is really a bunch of incentives. Apparently the U.S. government has already spent half of the 39 billion in incentives, and that has resulted in a 15 fold increase in reconstruction and manufacturing facilities for computing and electronics devices. So that results in that 327 billion in announced projects for semiconductor manufacturing.

01:18:58

So the gist is, as far as I can tell from what I've seen, in response to this article, is that which Chips act is good and has actually worked in achieving the goal of incentivizing various firms, TSMC, Samsung, Intel to invest in semiconductor manufacturing in the US.

Jeremie

01:19:22

Yeah. And absolutely. I think one of the at least the frame of this article is very kind of pro chips act this, you know, this worked out great, blah blah blah. And the argument is around leverage really. They're saying, look, we've just finished, pouring out over half of the chips act, you know, $39 billion in incentives. And that has led to chip companies and a bunch of other supply chain partners, investing themselves like over 300

01:19:48

billion over the next ten years. You know, so their argument really is like, look at that leverage. We we put in 39 billion. They put in, you know, over 300. That's like ten fold leverage on this. It really shows that this kind of unlocked a lot of a lot of interest in industry. Which is actually probably true. I mean, this is a super, super high CapEx industry, high capital expenditure industry, hard to get these projects off the ground.

01:20:11

And a little bit of de-risking can be highly leveraged in this context. Another area where we are, we're pretty much a bottleneck or sort of frustrating bottlenecks, certainly, like when we talk to, to some of the fab companies or experts who, who, you know, anyway, are linked to them, will share that. Yeah. You know, the local laws as well. Regulations are just so much red tape. It's so difficult to stand up these fabs.

01:20:36

That that's a real issue too. So money and regulation both issues harder for the federal government to deal with regulation side because that's that tends to be a lot of state level stuff. So so they're pouring money on the problem and it seems seems to be working out pretty well.

Andrey

01:20:50

Next story. This is an analysis piece from Yahoo on the second global AI Safety Summit that, has some notes as to maybe this one not being quite so impactful. So the first summit was in Bletchley Park, and it was, kind of a big deal at the time there was the Bletchley declaration on AI safety. There were, you know, very big names and a lot of representatives at that gathering. Now, there's the second AI Safety Summit, co-hosted by Britain and South Korea six months later.

01:21:28

And according to this piece, there are fewer people declared as, kind of committed to attending. And, it doesn't seem like there's, I guess, as much hype or impact from this second event compared to the first one.

Jeremie

01:21:46

Yeah. The turnout, and the individuals showing up, you know, fewer leaders and ministers apparently. And, and the French government put so has recently postponed so, so normally what would happen is the South Korean government was going to do it, and then the French government was going to do another event, I think, in 2024. So shortly after they recently announced they're going to postpone

01:22:07

theirs to 2025. So you would think that given that greater gap, maybe people would be more interested in showing up to the South Korean summit just because there's going to be less going on in the near future after it. But. Apparently that hasn't panned out. And so, you know, EA's chief regulators aren't going to be attending. The State Department said that the US State Department said it would send representatives to Seoul, but it didn't say who.

01:22:29

The Dutch and Canadian governments said they would not be attending. Even Jeff Hinton, though he was citing, in fairness, an injury that made it difficult to fly, said he's declined, that invitation. And so, you know, it's to some degree maybe not as surprising given that it's always harder for the second event to make the same level of impact as the first.

01:22:49

And there's also kind of been this shift towards like, okay, so we've all agreed on the high level stuff like we yeah, we want to deal with existential risks, who want to deal with catastrophic risks from AI and all that and a bunch of other issues. But clearly, you know, this is going to be much more brass tacks. What do we do about it? And there are a whole bunch of other details that are harder to agree on once you get at that level of detail. And they're also new issues being introduced.

01:23:14

Right? So we're Sam Oldman's talking about like, hey, we need, you know, $7 trillion for chips. That implies a global scale problem that we got to solve. There's energy, electricity powering all these data centers. And so, you know, all these new and also important concerns, around, you know, market concentration, environmental impact and so on. They're all kind of being fielded in this, in this new venue.

01:23:38

And it just makes it a lot harder to kind of get clarity, focus and agreement when you start to scope the issue more and more widely. So potentially that's kind of part of the complexity. That's I don't know, as the article puts it, you know, sort of keeping people away. But, certainly there are important people who will be attending and it's maybe going to going to keep the momentum

01:23:57

going, hopefully. The UK government certainly issued a statement saying they're optimistic, which I guess they would, but, there it is.

Andrey

01:24:03

Yeah. And story. Sam Altman, Jensen Huang and others joined the federal AI Safety Board. This is the artificial intelligence, safety and Security Board. The others include Microsoft CEO Satya Nadella and the alphabet CEO. So really big names. And apparently they will collaborate with a department of this board, will collaborate with Department of Homeland Security to develop strategies for protecting critical infrastructure from AI driven attacks.

Jeremie

01:24:33

Yeah. And this is, again, a result of the EO, right, which the executive order rather we we talked about that in the context of the Department of Energy's big giant smorgasbord of announcements. So this is the same coming from the Department of Homeland Security, also charged with a lot of stuff, with doing a lot of stuff under

01:24:50

the EO. So in particular, you know, the executive order requires AI companies to notify the government when they develop a system that could pose a serious risk to national security or like economic security, public health, etc.. And it seems like, you know, maybe this is related. It would be weird for, like, this board to be the channel through which these concerns are flagged, but certainly a higher level strategic overview, you know, might be might be offered there.

01:25:17

You know, one of the things like I think about especially given, you know, what, what my company does and works on and the fact that we've, like, worked really closely with whistleblowers at these labs is, you know, you would hope that perspectives that are not necessarily just the like executive levels of these labs are represented as well, because, you know, you do hear things talking to folks in these companies that deviate, sometimes from the publicly, made messaging from, from the CEOs and

01:25:42

other executives. So, to the extent that we're, you know, basing our homeland security policy, our, our, our approach to addressing these ultimately catastrophic national security risks to critical infrastructure on the assessments of people who are not the only voice in their labs. You know, I think that's maybe a missed opportunity. But this is certainly a really impressive group speaks to the immense convening power of the State Department.

01:26:06

That's right, of the Department of Homeland Security. Alejandro Mayorkas, who is the secretary of Homeland Security, will be the chair of this board. So it's got a lot of like, high powered people. The secretary is directly involved in that's that's great. It's good to see that level of attention on this. But again, you know, you would want to see, at least at some level, maybe not at the level of this board, but at some level, an accounting for other views that exist within these companies.

⁠¶ Synthetic Media & Art

01:26:32

In some cases that, yeah, you know, raise levels of alarm that are, significantly greater than or different from what you hear from, from executives.

Andrey

01:26:41

And until the last section with a couple more stories and synthetic media and art. The first one is titled era head creators say OpenAI sort of was finicky to work with, needed hundreds of prompts and serious VFX work for under two minutes of cohesive story. So Arrowhead was one of the kind of example creations that OpenAI released, that demonstrated that creatives could use or, to create, kind of, I guess not just clips, but actual creative outputs.

01:27:18

And this was a little, like, minute and a half ish, story or, you know, the very idea of, person with a balloon head. And so the story goes into how it was not so easy to do it. The difficulty was, in part because Sora wasn't consistent from shot to shot. So in the end, they had to use hundreds of regenerations to, produce this minute and a half long video. Apparently there's an estimate of 300 to 1 of generated content to stuff that was usable.

01:28:02

And on top of that, they also had to manually perform color grading with timing and some VFX to post-processing this. All of which kind of goes together to say that Sora is not necessarily going to just remove the need for human talent and human involvement in video production.

Jeremie

01:28:23

Yeah, at least. At least not today, I guess. Or maybe. Maybe we need more scale. Or maybe we need more. More, you know, fine tuning or whatever. But yeah, that definitely seems to be the case. It's also, you know, it is at odds with, with some of the messaging that we've seen, you know, that that this is sort of going to make this effortless as you, as you put it. Yeah. It was actually quite interesting to read the specific issues

01:28:43

that they had with it. It read like, oh man, you're like, this is a pain in the butt. I the one thing I didn't walk away with was a sense of how much of an efficiency boost this was over them trying to make an equivalent video without Sora. That was, you know, I wish there was maybe a little bit more there. The one thing that maybe orbited that idea was they spoke to, this guy, etc., etc.. Berg Patrick Zetterberg, he's the post-production, lead on

01:29:14

of, airhead. And he was saying like, I would guess probably it was like 300 to 1 in terms of the amount of source material that they got from, like the raw source or outputs to what ended up in the final. So, you know, like it's that that sounds like a lot. It probably is. And obviously it was a ton of work to kind of find that content and, and then tweak it, but, yeah, it maybe kind of makes me wonder if that may well still be better than the

01:29:37

alternative. So, yeah, hard to know how much of a leap this is, unless we can have that apples to apples. But, yeah. Interesting roadblock and definitely not not quite what was what was publicly messaged in the launch and all the hype around it. But yeah, that's going to happen, of course.

Andrey

01:29:51

And I'll tell our story. And it is that eight newspaper publishers have sued OpenAI over copyright infringement. So these publishers include the New York Daily News, the Chicago Tribune, Orlando Sentinel and various other organizations like that. And similar to a New York time, lawsuit, they are alleging copyright infringement through the UN. Frye's use of their articles in training. ChatGPT. Not too much more to say.

01:30:24

Another lawsuit dealing with the usage of data, of copyrighted data in training, language models and AI models and. Yeah, and another one that we'll have to keep an eye on, maybe.

Jeremie

01:30:38

Yeah. Microsoft predictably declined to comment. But yeah, it kind of seems like. Yeah, the the world of media is bifurcating into it to folks who want to do the like, hey, don't like don't touch my stuff and I'll sue you. And people who want to go in the more like Grimes direction where they're like, hey, yeah, use my stuff. I want to partner. Like, let's find a way to make it work.

01:31:00

And yeah, it's it's really unclear, like, because one of the big questions is obviously, are you shooting yourself in the foot long term if you partner, are you shooting yourself long term? The foot if you don't partner? I mean, it's really difficult to

01:31:10

predict where this goes. But yeah, more more pressure on OpenAI for sure to like, make sure that that the precedents are set right, because otherwise this really undermines a lot of their, you know, attempts to even, I suspect, compete with Google. I mean, I've heard speculation OpenAI is thinking about launching a search feature that's, you know, if that's the case, one of the things you might want to do is be able to surface results from

01:31:33

these websites, right? And be able to kind of like back a lot of what you're talking about. And if you if you can't do that, if you can't service, you know, breaking news from whatever institution, maybe that undermines your, your product vision a little bit.

01:31:44

But, yeah, a lot of this is going to be shaped in the, in the coming, I guess, months, I want to say, rather than years, though obviously our lawyer friends who listen to the podcast, who we we've had share a lot of analysis with us behind the scenes and then on the air, will obviously chime in and we'll, we'll share that.

Andrey

01:32:02

And with that, we are done with this week's episode of last week. And I once again, you can find the text news at last weekend I and as always, we would appreciate it if you give us a review and share of a podcast. That's not for anything. We do like to know that I was recording for an hour and a half, or even almost two hours sometimes is useful.

Jeremie

01:32:25

I will say, like three hours before catching a flight to do it. Do I get bonus points for that?

Andrey

01:32:31

Okay, yeah, it would give you some bonus points. So again, thank you for listening to this recording of ours. And please do keep tuning it.

Transcript source: Provided by creator in RSS feed: download file

#165 - Sora challenger, Astribot's S1, Med-Gemini, Refusal in LLMs

Episode description

Transcript

⁠¶ Intro / Banter

⁠¶ Tools & Apps

⁠¶ Applications & Business

⁠¶ Research & Advancements

⁠¶ Policy & Safety

⁠¶ Synthetic Media & Art