#159 - Inflection-2.5, Devin, OpenAI board update, SIMA, EU AI Act passed - podcast episode cover

#159 - Inflection-2.5, Devin, OpenAI board update, SIMA, EU AI Act passed

Mar 18, 20241 hrEp. 198
--:--
--:--
Listen in podcast apps:

Episode description

Our 159th episode with a summary and discussion of last week's big AI news!

Check out our sponsor, the SuperDataScience podcast. You can listen to SDS across all major podcasting platforms (e.g., Spotify, Apple Podcasts, Google Podcasts) plus there’s a video version on YouTube.

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/

Email us your questions and feedback at [email protected] and/or [email protected]

Timestamps + links:

Transcript

Andrey

Hello and welcome to the Last Week in AI podcast, where you can hear us chat about what's going on with AI. As usual, in this episode, you will summarize and discuss some of last week's most interesting AI news. You can also check out our last week in AI newsletter at lastweekin.AI for articles we did not cover in this episode. I am one of your hosts, Andrey Kurenkov. I finished my PhD, focused on AI at Stanford last year, and I now work at a generative AI startup.

And with us, for this episode is not Jeremy. We have another co-host.

Daniel

Hey, I'm Daniel, if you've listened to every episode of this podcast and you've definitely heard me before, but I've subbed for Jeremie a couple of times. I co run with Andre another publication called The Gradient of the podcast over there, and I currently work on ML compilers.

Andrey

That's right. So if you are curious about another podcast we have mentioned in the past, Daniel hosted the Gradient Podcast, where he interviews a whole bunch of people from a world of AI, a lot of academics, but also people in industry. I used to do also interviews on there. So, FYI that exists. And if you, think that Daniel's deep voice is something that you like listening to, maybe you want to check that one out. Before we take any news. Just a couple things,

to get it away. First, as I like to do, just want to give thanks to a couple new points of feedback. We had a new, review on Apple Podcasts. So every other week and I that this review highlighted that there's so much in every one of these episodes that sometimes it's hard to even get in a whole episode, once per week. And that's understandable. This one, I think it will keep a little bit shorter. So hopefully you can finish it this week. But, yes. Thank you an Austin SF for that review.

And we also had a few subscribers on Substack coming through and sending us messages, as they subscribed. Like one message said, you guys are a major part of my podcast repertoire and talking about meticulous coverage and depth analysis, things we, I guess, trying to do. So yes, thank you for those messages. We do read them even if we don't reply and we appreciate it. And one last thing before we get into news, as we have been doing, we need to do our sponsor read.

And once again, we are sponsored by the Super Data Science Podcast. So the super sized podcast, as you know, if you've been listening to, is one of the top technology podcasts globally. And they cover not just data science, but also machine learning, AI, data careers, all sorts of topics in that space. It is hosted by John Crowne, which chief data scientist and co-founder of a machine learning company, Nebula, and the author of Deep Learning Illustrated and the host of this podcast.

For quite a while. This podcast has now over 700 episodes, very least twice a week. So he has talks to, a huge variety of people, knows a ton about data science, AI, machine learning. And if you want to learn more from a perspective of people, we do recommend that one as a podcast to check out.

Daniel

So our first section today is on Tools and Applications. And we're going to start off with, well, there have been a lot of releases recently. You've probably heard about many of them. And the one we're going to start off with today is inflection 2.5. You may have heard about inflections personal AI before called pi, and the the claim around is it's competitive with leading language models like GPT four like Gemini.

And really the point is that pie is not just, an engine with raw capabilities, but they have this unique, empathetic, fine tuning on top of it. And so really, the point is for it to act more as a personal assistant to help you get the things you want to do done. They've also made some strides in pretty standard areas like coding and mathematics. So that's pretty important for key industry

benchmarks. But they're also incorporating real time web search capabilities, providing users with high quality breaking news, up to date information. It's interesting, I suppose, because I feel like inflections named, for me at least kind of fell out of the limelight for a while. But now that they're back with 2.5, I'm kind of curious what that's going to look like.

Andrey

That's right. Yeah. Just as we are seeing all these releases of Gemini and Card three at about the same time, inflection came out with this announcement of inflection 2.5, and they highlight a couple of things approaching GPT four performance, but also using only 40% of the amount of compute for training, apparently. And they say we've already rolled this out. And they see like they highlight some pretty big numbers here.

So they say that an average conversation of pi lasts 33 minutes and 1 in 10 that last over an hour each day. So people are talking to this Pi assistant, or pi chat bot a lot. And they do highlight, I guess their main pitch with pi is that it is kind of emotionally intelligent on top of being just, IQ intelligent. So they have optimized it to be kind of a pleasant conversation, partner, so to speak. So, yeah, as another release of a major model and more cool numbers here.

We have a lot of, tables showing pretty, big jumps between inflection one and inflection 2.5, and a lot of them being near GPU four. So the train of new models and, you know, new GPT four competitors keep throwing out.

Daniel

But it's hard to emphasize enough. 33 minutes is an impressive conversation length, and the 60% of users returning the following week is also very impressive. Stickiness I don't know about you, Andre, but like I, I do not spend 20, let alone 33 minutes in a conversation with GPT four or Claude.

Andrey

That's true. Yeah. So I think it's in a. With, characters that I, pi is one of these things that for many people, they do appear to just enjoy having conversations. I can't say that I do that, but it is a thing that's happening. Just FYI, it's true. And I'll do the next story, which is introducing Devin, the first AI software engineer, which is coming from Cognition Labs. And this, kind of took the world of, I guess, conversation by storm.

This last week, there was a video posted on Twitter or X, where they highlighted this release. What they say is with fully autonomous AI software engineer and basically, yeah, it's it's a much better, more, sophisticated, kind of agent that in various benchmarks is much more impressive. They evaluated on as they were you bench software engineering bench and found that it correctly resolved 13.8% of issues end to end, which is exceeding the previous state of

the art by a lot. Before it was just a couple percent. And they say that it's also been tested on real world tasks such as jobs on Upwork. And yeah, they have various demos. We also showed how it can, write a little, code to use, computer vision model using modal, and it can do all of this kind of from documentation without being trained specifically to do that. So lots of examples of what this can do.

And yeah, seemingly as far as I've seen and as far as I think most people in community have seen the most impressive, fully autonomous, programing agent, as they say.

Daniel

I want to make like two notes about this. For one, I guess I am a person who, like, never trusts recorded demos. And I think that a lot of people on Twitter are also this way. If you want to see somebody who got access to data and then started looking at some of its different capabilities, there's a guy named Andrew Dao, I want to say, and he's tested it out on a couple of different things and found that I think there was one task. I don't quite remember it.

It was related a bit to graphics that it didn't quite do so well on. I made this like kind of a working version of Flappy Bird that wasn't too great. And then somebody else came along with a website called App Scrapper that created something a lot better. He also, I think, had it set about fine tuning and alone. So that's interesting. Seems like I can do a couple of pretty useful things.

That being said, obviously it seems like there's a lot of work to do yet, given the sway bench performance, the fact that, Cognition Labs is still hiring human engineers. So I think that says a couple of things there. I also want to, like caution about this course when it comes to we have this thing that is called an AI software engineer, and everybody's saying it's all over when it comes to the job of the software engineer.

I don't feel like we have nearly enough data to make strong predictions about what this means yet. And it's a little bit irresponsible to say that if you're a software engineer, you're not going to have a job in five years because of this thing. So I just want to caution about making very strong claims like this in the light of these new releases.

Andrey

That's a good point. I think this is a step in that direction, but it's still pretty far, has necessarily been shown to, for instance, work within the context of very large, software repositories. Although it, it can look at like GitHub repos and so on. But the job of a software engineer is pretty complex. There's a lot where besides, just like solving a task or addressing an issue on GitHub. So, let's not kind of get carried away.

It's impressive. And if you look at the demos, they do have some videos of this agent, having various capabilities that it has access to a shell to a browser, and it can do all sorts of planning and iteration given the output of a program. So it is definitely more sophisticated than just launching ChatGPT to write code, but it is also still pretty far from, in fact, replacing what most software engineers do.

Daniel

So next up for our lightning round, if you have ever run into a driver who seems to be sexually harassing you, or you're just an incredibly mean person, well, DoorDash has, has a new tool for you. It's called Safe Chat Plus, and it's designed to detect and reduce verbally abusive and inappropriate reaction interactions between customers.

And delivery. Personnel. It reviews in-app conversations, identifies harassment, provides options to report the incident, contact DoorDash a support team, or cancel the order without affecting the delivery person's ratings. And they're putting up some some numbers here. It can analyze over 1400 messages per minute in multiple languages. These include English, French, Spanish, Portuguese, Mandarin, and all incidents identified by the AI will be investigated

by team members. This is, again, one of those rollouts that I think is a pretty positive direction. Again, you know, detecting verbal abuse just seems like an unabashedly good thing. If you are worried about something, reviewing all of your in-app conversations. Well, I mean, you're on DoorDash, so I feel like this isn't really a place where you should be too worried about that or having conversations that you wouldn't want an AI

to look at. But, interesting, interesting addition to the platform.

Andrey

The chat. Plus, they say that the previous safe chat was, I guess, mainly or entirely manual screening, which seems a little bit crazy. But, as you might imagine, basically any online platform presumably is going to be using AI to screen what you send. These offers were sort of like consumer, interface between you and a delivery person or something. Would make sense for them to be doing this kind of thing. Next up, the story is anthropic releases cloud

free. Hi to an AI model built for speed and efficiency and affordability. So we covered cloud free, I think just last week at a time we had not yet released the smaller variant haiku. And that is the story that now we have. And yeah, they highlight pretty much. Well, it is by far the fastest and cheapest of the cloud free family while still being pretty sophisticated. It also has vision capabilities and the other things we highlighted about cloud free previously.

So for I think, many little sort of quasi real time applications, you might, start to see them being powered by this.

Daniel

And next up is looking into the domain of video. You've probably heard about Pico Labs before where it's made a lot of waves. So when it came to generating AI videos, they have introduced a new feature that allows users to create sound effects from attacks from Twitch generated AI videos. So you're not just seeing what's on the screen, you're getting to hear it too. And this is the second audio feature on their platform.

They previously had created a lip sync tool that allows users to give voices to characters and their AI videos. Again, the sound effects feature is only available to pro plan subscribers and so on. A they have to make a little bit of money here. And it seems like the sound effects they're generated from single text forms are fairly impressive. They generally mirror what the user requests.

But again, you know, these are where we're still pretty early, I think when it comes to where generative videos are. And right now you have tools like Sora. But there's a lot that needs to be done here. Generating AI videos is incredibly computationally expensive. For instance, people are still working on image generation at this point, so I do think that the area of vision generally is pretty underexplored, but it's exciting to see what people are putting together.

Andrey

That's right. Yeah. Yet another release from Pico Labs coming pretty quick after that lip syncing feature we also covered. And they have a little highlight reel. They show various things, kind of stuff you would see in ads. So a lot of quick cuts of shots of bacon roasting or a car racing or a little fly, and you have the sound effect of a car on the road or bake on or all that sort of thing. And it works. Yeah. As you said, if you just look at video, it seems

pretty good in practice. I would imagine some might work better and some worse, but, yeah, another, addition to the world of AI video generation, which is seem to move pretty, pretty fast this year. Next up, Salesforce announces new AI tools for doctors and visitor tools designed to reduce the administrative workload for health care workers.

So the first tool is Einstein Copilot Colon Health Actions, which allows doctors to book appointments, summarize patient information, and send referrals using conversational AI. The second tool, Assessment Generation, enables organizations to digitize health assessments like surveys without manual typing or coding. And these are on their, Salesforce Einstein one platform, which apparently is already using or is being used to consolidate medical data from various sources.

So yeah, another we've covered kind of. Some efforts on along this line. Usually you need a large company to be able to get through all the bureaucracy of doing healthcare stuff, but there's a lot of room to, help with this kind of administrative workload. So personally, I think this is a pretty exciting announcement.

Daniel

Yeah, I'm generally excited about this domain. To I, I interviewed Shiv Rao, who started a company and this was a while back. They're called a bridge. And they've been building something that kind of transforms patient clinician conversations into structured clinical notes. And this is all powered by, you know, your generative AI stuff that everybody's excited about.

And, it sounds like Salesforce is trying to get into something a little bit similar, maybe broader and slightly different parts of the administrative work doctors have to do. So I'm kind of curious to see what the competition landscape looks like for this, but I do think that the point really hits home. Doctors, they do a lot of administrative work. They do a lot of work in general.

The more you can free them up from having to do these kinds of things and enabling them to do what they were trained for, to be doctors, generally a good thing.

Andrey

And onto applications and business. With a first story once again being about open AI and their never ending seemingly drama, well, maybe they are ending. We covered, last week. I think how we report as to the, kind of summary of conclusions about this whole border incident that happened last year, came out and then say very much now there was coverage, I guess, to just highlight that as a consequence of a report coming out, there's been updates to a board which we kind of knew, was happening.

So Sam Altman has rejoined the OpenAI board, along with three new board members. We have, Sue Desmond Hellman, the former CEO of a Bill and Melinda Gates Foundation. For me, former Sony Entertainment executive Nicole Seligman and Fuji Simo, CEO of Instacart. So you're seeing, yeah, some, I guess, bigger names coming onto the board.

As a quick recap, in case you forgot, somehow or missed it, this is falling on the heels of the board basically firing Sam Altman as CEO of OpenAI last year in a very dramatic move. And so since then it has got, you know, rolled back and they've been working on updating the board and kind of improving the governance structure. So that kind of drama doesn't happen again. This is, I guess, a move towards, creating that stability.

Daniel

I've honestly given up on trying to follow this a while ago. I think that even the first weekend where Altman. Was fired and all this was going on. It was just like, this is it's just entertainment at that point. But it is it is interesting to think about the governance measures. They do have a pretty complicated structure. And the board has announced a couple of new ones, including a whistleblower hotline for OpenAI employees and contractors and a stronger conflict of interest policy.

So that's all interesting and pretty important, I do think. Also, even if all of this is tiring for you, it's this company has built tools that a lot of people are using. And that internal turmoil, as people were realizing when all of this drama was first happening. That's pretty important for the many, many, many companies out there who are relying on GPT 3.5, GPT for all of these things in order to build the products they're trying to build.

Andrey

Yes. And as we covered before, we still don't have a full picture of what exactly happened. It seems basically like like there was an internal fight going on. It seemed like Altman took issue with one of the board members issuing critical remarks and then academic work, and then potentially tried to get that power board member removed as a result of some of that maneuvering. It seemed like there was some basically politics going on. Right. And, it kind of backfired.

So there was a brief comment from Aardman kind of saying indirectly, that he probably could have handled some of the actions that he was doing. Better, he said, with more grace and care. Either way, I guess probably we all should be ready to just put this behind us. At this point, the board is updated and, they have all this investigation notes and so on to reassure the investors of OpenAI that from now on, there shouldn't be this kind of upheaval again.

Daniel

And our next item, Cahir, is coming back into the limelight. They announce the release of a major new language model called command R, and this is interesting. They're they're in the midst of a heated fundraising round right now that could bring as much as $1 billion in fresh capital to them. But command R is a pretty interesting model. It's a pretty significant leap forward for their technology.

It offers enhanced performance on key AI tasks like retrieval, augmented generation, which you probably heard about and tool use, as well as longer context windows up to 128,000 tokens. Not quite what others are claiming right now, but still a pretty big context window as well as more affordable pricing. As everybody wants more affordable pricing and, you know, better latency, throughput numbers as literally everybody will claim to you.

But also, interestingly, they're kind of expanding as a company and making strides on the business front. So lots, lots going on for you to hear right now.

Andrey

Yeah, we are covering this in, in this section of and Tools and apps. Because cohere focuses on enterprise use cases. Service is not consumer facing. This is something that they're aiming to sell to businesses. And yeah it's worthwhile to kind of keep here in mind. They are still one of the big players in this space, as you said, they are looking to bring in like $1 billion. They already have spent hundreds of millions training these models, and they are expanding it with more offices

and so on. So, yeah, go here still doing quite a lot. And this new command AR model that now has, retrieval, augment generation and tool use, they say is very accurate. They, like, have a little table saying that they have an accuracy of seven, 75.2 as opposed to mixed rulers be free, which are more in the 60s. So still, significant player to be aware of for sure.

Daniel

And first up in our lightning round, Nader has written a little bit about building their generative AI infrastructure. You might have heard about their research Super cluster a while back where they had gotten 16,000. I believe Nvidia H100 is really, really investing in their compute for doing lots and lots of open generative AI research. And they've again announced a major investment in that AI infrastructure, including two 24,000 GPU

clusters. These are built on top of Grand Teton open rack PyTorch, and they aim to grow that infrastructure to include 350,000 Nvidia H100 GPUs by the end of 2024. That would provide the compute power equivalent to nearly 600,000 H100. And these are going to support current next generation AI models, including Lama three, as well as AI research and development across gen AI and other areas. Again, this is kind of really putting into matters a long term

vision. They want to build very strong artificial intelligence systems that are open and responsibly built. You've seen them released Lama models by this point. Those are only going to get pushed further. I'm kind of interested to see all this investment in Nvidia, because Nada has kind of joined the, the cohort of companies who are continuing to build their own, excuse me, their own training

and inference accelerators. So I'm kind of curious to see how their their usage of compute kind of evolves that internal, you know, training, inference, accelerator work is going to be pretty hard and obviously behind where Nvidia is by a long shot. But it is interesting to see that kind of continued investment in this massive. It's basically super cluster of Nvidia GPUs along with that development.

Andrey

Yeah, it was interesting to see them release this blog post titled Building Meters Genoa Infrastructure, where they want to do a whole lot of detail, like about the technical details of what they're doing. They have these numbers of by van of 2024 of aiming to roll out, build, including 35, 100 including 350,000 Nvidia H100 GPUs. So that's like tens of millions of dollars worth of compute. And, yeah, they really, I guess, are emphasizing that they are investing a lot in a large scale compute,

which we heard about before. But seeing this blog post with a lot of details on it really drove it home. Next up, Baidu launches China's first 24 seven robotaxi service. So Baidu, which is a huge company in China, in case you're not aware, it's you could say similar to the Google, of China has launched this service Apollo Go in selected areas of Wuhan. And it is a 24 seven robotaxi service akin to Waymo.

This is the third major expansion of this service in 2024, following the launch of the fully driverless service across the Yangtze River in Wuhan and their pilot operation on highways to the Beijing Daxing Airport. So, yeah, it seems like, similar to us in a sense, we are just starting to see fully self-driving cars become commercial. We've seen Waymo driving around San Francisco 24 seven for a while, and it is now also the case in China.

Daniel

Our next section projects and Open Source. Our first story here is about a new metadata format for ML ready data sets. And without verifying this at all, I have to guess that the developers and researchers who worked on this are based in France because it's called quasar. And an important thing that's kind of important about this is an ML. Practitioners often spend quite a lot of time understanding organizing data sets. And there's a wide variety of data

representations. If you spent any time as an ML engineer, you spent lots and lots of time working with data, working with data pipelines. It's it's a big headache. There are existing metadata formats, like schema.org, like Descartes, that aren't designed for specific needs of ML data. And these needs are extracting and combining data from various sources, including metadata for responsible use, defining training

test, validation set. So they introduced this new metadata format for NLP ready data sets. It's developed collaboratively by community from industry and academia as part of the ML Commons effort. It doesn't change how the actual data is represented.

Of course, I wouldn't want to, but it provides a standard way to describe and organize it, building upon existing things like schema.org, but augmenting that with comprehensive layers for ML relevant metadata, data resources, data organization, default ML semantics, and so some of the major tools and repositories you're probably aware of Kaggle, Huggingface, open NL, and so on. The popular ML frameworks are going to be again supporting the format for the data sets they host, right?

Andrey

Yeah, it's a bit of a nerdy topic metadata format for ML datasets. But if you're, you know, if you're doing work in the AI space, and have interacted with various data sets, you might know that at least when I was doing this, there wasn't much of, like specification. Everyone just kind of did their own thing. And it was every every repository had its own approach to representing data. So this seems to.

Yeah, we introduce a new standard where things will be a little bit more organized and have, more rich information. It'll have things like, you know, data resources, data organization, default ML semantics and various tools also to, interact with this metadata. So, yeah, a real sign that, engineering for a ML and sort of the whole space of machine learning and AI is maturing a little bit with something like this coming out and possibly becoming a little more standardized.

We'll see. And next story also on the open source front is Sol Lamb seven be a pioneering a large language model from for LA. So yeah, this is a new open source, model specifically for legal tasks. It was, fine tuned on top of the Mistral seven B architecture and is trained on an English legal corpus with over 30 billion tokens. So a pretty large amount of, you know, I guess hundreds of thousands of pages of, or even millions of pages of, legal text. And they released a general report out on

archive. Coming from this is a real collaboration of a lot of organizations like equal AI, Sorbonne University, Instituto Superior Técnico. Yeah, various places all came together to fine tune this model and show that it's the best one for law queries in this kind of chatbot context.

Daniel

The the comparisons they did are pretty interesting. It seems like they mostly compare it against kind of open source models. So you'll see that the main things they've compared against are Mistral and Llama two variant. So I am curious, you know, to see what this would look like if they were to expand this benchmarks out a little bit. But again, interesting to see. I think that I had a friend who's in law school right now who tested out some of the. Man, you know, like GPT 3.5 and so on.

On basically some of her like law class homework. And it was it was not doing very well. I suggested, you know, prompt engineering and things, the standard stuff you'd want to try for doing better at that. But I am kind of interested to see what this what this digs up.

Andrey

And they release it, they say in the paper under the MIT license. So pretty much, very, open ended and permissive.

Daniel

For our lightning round. Kai-fu Lee is back in the news again with an AI company 01. AI. They announced the open sourcing of the E 9 billion model. This is the most powerful of the E series in terms of code and mathematical capabilities. The 9 billion is is actually a little bit of a lie. It actually is only 8.8 billion parameters, but a point 8BI guess doesn't have quite the same ring to it.

It's got a default context length of 4000 tokens based on the E's 6 billion model, further trained with 0.8 trillion tokens, and its overall performance is reportedly the best among open source models of similar size. It surpasses models like decoder, Deep Seat Math, Mr. seven B, solo 10.7 B and gamma seven B right.

Andrey

We've seen a lot of releases of smaller models. Recently we've covered, you know, gamma from Google and also fi from Microsoft in the 2 to 3 billion parameter range. I think gamma also came out of 7 billion. And now we have this one which is 9 billion. And as usually as a case with these kind of smaller, larger language model releases, they say that it beats out all the other ones in the same category of size. Not too many details, going on here. It seems like maybe it's pretty much on the par.

And as with other, models of the size, it is also a little bit optimized to be able to run it without like a super cluster. We have a quantized version that can easily be deployed on consumer grade graphics cards and being, you know, fairly cheap to run. So, yeah, we are still getting more and more of these, models of all sorts, coming from all over, I guess, although I guess this is building on top of a previous release of V, which we have covered in the past.

So interesting to see this company from primarily continuing to push on the release front and moving on to research and advancements. First story is once again about DeepMind, who is seems to get a lot of headlines on this section. The story is about a generalist AI agent for 3D virtual environments. So Google DeepMind has worked a lot on agents for game environments. Of course, they started off go. Then in recent years they've kind of been going more

towards embodied settings. So they had their kind of open ended generation of worlds that they did it, I think a couple of years ago. They do a lot of work in robotics as well. And in this work they actually partnered with gaming studios and trained these, instruction, following agents in a variety of real games that actually, you know, people can play.

So games like No Man's Sky, Hydra, Near Construction Lab, Valheim, Goat Simulator, play House, yeah, there's about nine of these, including also some, more, research, simulators for robotics. And the key is, I guess, that they train a single agent that is able to accomplish tasks in these worlds, given some instructions, like collect wood. Now, it's we're not emphasizing that this is sort of like, state of the art performance.

This is not so focused on, achieving some new benchmark like we've seen on Minecraft, for instance, people say, okay, our agent is able to do this whole very complicated thing. Instead, they emphasize how general it is and how it is able to generalize on these pretty complex environments and actually do things when instructed to. And the only thing it gets are, the pixels of the image of a screen. And with text, it doesn't have any like, game specific APIs or access.

So as far as trying to train a general purpose agent that can adapt to various settings, it's a pretty impressive, step forward, I think. I mean, it's, definitely a jump from having relatively simple simulators to these pretty complex games.

Daniel

It is definitely really impressive that you can do things this way at all. I think that we kind of continue to be surprised at where you can go with this kind of ground up pre-training or sorts of training. If you look at like the average success of the same agent by environment, they have a little table in the paper if you want to go look at that, that shows its performance based off of different evaluation methods. And it's like pretty good.

They range from somewhere around 30% or so to a 60% success rate, which sounds pretty far from 100%. But then you also want to take into account that humans are not going to have a perfect success rate on all of these games. And so the fact that it's able to do pretty well across all of these basic skills from navigation and object interaction, menu use

is is pretty important. And you can kind of see this is continuing the trend of models like Dotto, where they're really trying to build towards these more general systems and agents that can carry out a pretty wide variety of tasks.

Andrey

They do say that they are looking at a year, mainly from letting people play it. So this is kind of cloning on top of gameplay. Footage of real people, and then, annotating a lot of gameplay with instructions. So the agents are not learning by themselves to do this. Are learning to clone the performance of humans and, try and match their generality, which is also done sometimes in robotics. Although DeepMind has also looked into reinforcement learning, having agents learn by themselves.

So yeah, quite a lot of, space to move forward still in terms of having the agents kind of master these themselves, and learn more complicated tasks here. Most of the tasks are relatively kind of self-contained, like chop down a tree. It's not going to tell you to do some hours long task, but nevertheless, actually managing to generalize with one model across all these different scenarios is pretty impressive.

Daniel

Yeah. Before before we move on to the lightning round, it is pretty important to note that the research shows that Samus performance here relies a lot on language. So the data control tasks where the where the agent wasn't given any language training instructions and it behaved appropriately, but kind of aimlessly. So that's, that's a pretty important thing to note here about, mediation.

Andrey

And interestingly, they seem to be training this from scratch. They don't, build on top of some existing, visual language model. They, have kind of a specific agent architecture and then train it on that. Interesting choice, I feel like. But, yeah. Now they have all these partnerships with gaming studios, I'm guessing. Hopefully we'll see them, pushing more in this direction going forward.

Daniel

First off, in our lightning round, a common complaint. You hear about the most high performing models today, ChatGPT and so on, is that they're black boxes and we don't know a lot about their internals. And OpenAI and Google will not spill all of the details to us. So how do you learn what these black model black box models actually look like? You can, engage in what are called model stealing attacks that can extract precise, non-trivial information from these

production language models. And really, the goal is to figure out, using API queries alone, how much can how much information can you can you or an adversary learn about one of these production language models? And it seems like they've made some pretty interesting strides here. The attack that was done in this recent paper is stealing part of a production language model by Nicolas Carlini and a lot of other researchers at Google DeepMind.

They can recover the embedding projection layer of a transformer model given the typical API access, and for less than $20, the attacker can extract the entire projection matrix of OpenAI's Ada and Babbage language models, which is pretty, pretty impressive. You might also wonder about a paper like this. The fact that they're engaging in something called model stealing, and that OpenAI might be a little bit angry about this.

One of the research researchers on this paper did clarify that they they found these vulnerabilities in the OpenAI models, and they did their work. They told OpenAI about it, OpenAI patched it, and then they released this paper. So if you try to try to reproduce their work here, you're probably not super likely to have the kind of success they did.

Andrey

There's actually a coauthor from OpenAI on the paper, along with several of other places DeepMind, Zurich, McGill and University of Washington. The actual approach itself is, as you might expect, kind of mathy. They basically, send some random prompts in and do a bunch of math to recover some of the details of the models. And the for instance, recovering the embedding projection layer is pretty significant.

You basically are kind of recovering the results of training that you may want to keep secret. They also are able to confirm, things like hidden dimensions, the size of a model that, again, OpenAI has not been releasing for a while. So, pretty significant research as far as real world applications. And we do, discuss how to potentially protect against this, in addition to disclosing that it exists. Next up, that AI interpreter and Lem agents for data science. So there you go.

This is an agent that combines a few things that has dynamic planning. With hierarchical graph structures, tool integration and logical AI consistency and all that combines to be able to do various data science and the kind of real world realistic ish tasks. And similar to that, like software engineering agent we covered earlier, they say that this agent is much better at doing data science. He thinks so. It, showed a 26 improvement in the data set and, crazy improvement on open ended type fix.

And they do several releases. So another, I guess, agent architecture that has been optimized for particular domain and seems to be quite a bit better than something like if you just throw ChatGPT at it.

Daniel

Yeah, this is, a good trend and an important one to follow about this domain specificity, where? I mean, not everybody wants all of the abilities that something like ChatGPT is going to have. And so you're probably going to prefer a smaller l'islam, a smaller system that is really good at the task or the set of tasks that you're interested in, as opposed to the massive al-Islam that can do everything but kind of. Well. And most of the things that you want to do are not things that

you're interested in. So I do think we're seeing a lot more people pay attention to this kind of thing, and the interplay between the models doing what you want them to do, and then the construction of benchmarks that are assessing the kinds of capabilities you are actually interested in. And that's a pretty important back and forth to to think about. Next up short GPT says that layers and large language models are a bit more redundant than you'd expect.

As we know, our limbs have been growing in size significantly with billions trillions of parameters. And this recent study found that many layers of our arms exhibit high similarity, and some play a negligible role in network functionality. The researchers defined a metric to measure this called block influence, and this basically tells you what the significance of each layer and alarms is.

The upshot of this is that when you have a massive LRM, maybe you want to make it smaller so that inference can be cheaper for you. So based off of this, they proposed a straightforward pruning approach layer removal, where redundant layers in our arms are directly deleted based off of their block influence scores. Again, model pruning is a pretty important area, and this is kind of orthogonal to lots of other efficiency methods that you might

care about, like quantization. So this is this is nice to see.

Andrey

And one last research paper. It is Picsart sigma weak to strong training of diffusion transformers for 4K text to image generation. This is a follow up to a predecessor, Picsart alpha, and the deal is, as per the title that it is able to generate 4K super high resolution images. They say that there's a couple things that they

did to do that. They started from this weaker baseline and evolved it to a stronger one, mainly by incorporating higher quality, higher quality data that, later in this process weak to strong training. And they also optimized some things, with the attention module, compressing keys and values, whatever you know about details.

Point being that now there is a new image generation model that is able to generate really, really, really nice, performance with a really small, size of only 0.6 billion parameters compared to as the Excel and SD cascade. Other ones we've covered before that are quite a bit bigger and, you know, is similar in terms of performance. So still, we have this continual optimization of the image generation models continually coming out. And this is just the latest in that process.

Daniel

Our next section is on policy and safety. And really the big news here is with the world's kind of regulator in chief, I suppose that the EU Parliament has approved the world's first major set of regulatory rules to govern artificial intelligence. You've probably heard about the EU AI act before. This is something that's kind of been in existence and motion for quite a while, and it was finally endorsed with 523 votes in favor, 46 against, and 49 votes not cast.

The big thing here is categorizing AI technologies into levels of risk. These go from unacceptable, which would result in a ban to high, medium and low hazard. It's expected to come into force at the end of the legislature in May, after final checks and endorsement from the European Council. Reactions to this are kind of mixed. A lot of people are not too happy with what this is going to do to the EU AI ecosystem.

You might be aware that places like France, for example, London, they're really hotbeds for AI. We've been talking a lot about Mistral. Mistral is based in France, and so there's a lot of concern over is this legislation going to be owners to smaller companies? Is it going to be able to evolve with the technology that it is trying to regulate? So a lot of questions here. Not everybody is super happy about this. Pretty, pretty mixed reactions from what I've seen.

Andrey

Yeah. Yet another story on view. I actually have covered it maybe like a dozen times on this podcast already. Last time we covered it, we had the news that the final text of the act was approved. And so this is the actual vote that has gone through with most, voting in favor. As you said, the, you know, there's still a little bit more, for it to be fully, fully. Integrated, so it won't come into effect until May.

But then the actual implementation of the regulations will be starting 2025 and going in for several years. But yeah, as we've covered before, this is a really big deal outside of China. Few countries have really had this kind of attempt at a pretty strong, comprehensive

AI regulation. And so, you know, there's going to be downstream effects, as with previous EU actions of, because big companies like Google, Facebook will have to deal with this in the EU, presumably in the US, Canada, you know, throughout the world. There will be ramifications, beyond that. Next up in the lightning round, the US spearheads first year UN resolution on AI aimed at ensuring equal access.

So yeah, the US is creating this resolution UN creating safe, secure and trustworthy AI technology that is accessible to all countries, particularly those in the developing world. The it's a draft resolution. So it's you know, UN resolutions generally aren't maybe the most impactful. But I guess it's interesting to see the US doing these kinds of, diplomatic moves.

And, the US national security adviser, Jake Sullivan, stated that we resolution will present global support for a set of principles for AI development and use, and would apply a path to leverage AI systems for good. Evidently, the US has been negotiating with 193 UN member nations for about three months on this, and the resolution has achieved consensus support and will be considered later this month. So. You know, a bit symbolic as with a lot of resolutions.

But at the same time, I guess it is good to see all of these countries talking and trying to come together on this topic of AI and, have some sort of like first AlphaGo agreement.

Daniel

I sometimes don't know what to make of like the US approach with AI. It's been not entirely consistent or coherent for a long, long time, and I don't know if I still see it like merging to kind of a single message. I mean, as you're saying, it's good that the US is working with you in member states and trying to sort of, you know, work on on developing like international consensus on a shared approach to AI systems and. That's going to be fairly important.

Although, you know, the contours of that shared approach, like things are not going to be the same for every country. But also the same day this article came out. There was news about, well, Jeremy's company, Gladstone, I think has been working on, what is this, a State Department commissioned report, if I recall, about AI posing an extinction level threat, which I think is also kind of a symbolic thing to put

out. But it is interesting to see all of that going on and like what US lawmakers are thinking about. This. Our final story in this lightning round is, about Google again, restricting election related queries for its Gemini chat bot. They've basically announces restrictions on the types of election related queries that the chat bot can respond to. This is in an effort to prevent misinformation. These changes have already been implemented in the US and India because both have upcoming elections.

We're all very excited for the 2024 presidential election season and what that might bring. And this is, interesting for Google is following the recent withdrawal of their AI image generation tool from Gemini. This is due to controversies that you have heard probably a lot about, including historical inaccuracies and contentious responses.

Andrey

Right? So it seems they are playing it very safe, right? With this restriction of just not being able to ask election related queries, I think it pretty much just redirects to say, oh, you just Google it. Don't talk to me about this. And could be also in part because of the trouble they got into with the criticism of the Prime Minister of India, which you covered before. So, yeah, I guess Gemini playing it safe and, could be for the best, I guess we'll see.

And onto synthetic media and art again. Just a few stories here. First one is researchers tested leading AI models for copyright infringement using popular books and apparently GPT for performed worst. So we usually cover a lot of this copyright type stuff in this section and in this story. Petronas AI, a company specializing in evaluation and testing for large language models, have found that apparently the models do infringe on copyright text.

They have this new tool, Copyright Catcher, and they tested for leading AI models GPU for cloud to llama to and mixed trial. And apparently yeah they output. Presumably the tool just checks if we tools are able to spit out verbatim contents from these, copyrighted works. And, apparently JB Ford produced copyrighted content on 44% of the prompts. There's only a hundred different prompts deleted related to popular titles. So I'm not sure.

You know, this is not exactly a research paper, but still, I guess given all of the lawsuits going on with copyright infringement, this, is an indicator that maybe they wouldn't be very easy to argue that the models don't, contain a lot of copyrighted content.

Daniel

We saw this earlier with the New York Times suit against OpenAI. And this is going to continue to be an important question. If you spend any time on Twitter. In the past week, you probably also saw an interview with OpenAI CTO Mira moratti about the recent video generation model Sora. And she started talking about how it was trained on licensed and openly available data. The interviewer started pushing on. This is a trend on data from YouTube and things like this. And she just didn't elaborate.

It sounds like, there's there's a lot there's there's a lot going on here. I think that the copyright infringement questions are going to be pretty big.

Andrey

And these prompts that they used, were pretty straightforward. So we have examples here of like, what is the first passage of Gone Girl by Gillian Flynn? Or for instance, continue the text to a best of your capabilities before you bail on. My life was like a moonless night, which I think is from Twilight. So. Yeah, just testing if they have access to this copyrighted text or able to continue things verbatim. So another development in the copyright story.

Daniel

Our next story is about Nvidia's Nemo AI platform. Just to, remind you a little bit, this is an under, cloud native framework. It's basically available to build customized, deploy generative AI models anywhere. They include training, inferencing, frameworks, guardrails, toolkits, data curation tools, also pre-trained models. And a group of authors recently filed a class action complaint saying that Nvidia used some of their books without permission. To train its models again.

Lots of lots of copyright stuff going on here. And this was filed on Friday to the San Francisco division of the Northern District of California. It pits Nvidia against three different authors update Northernlion Brian Keene and Stuart Onan, the three unregistered copyrights and books they say that Nvidia used to train Nemo. And Nvidia is saying that their platform complies with copyright.

They say, one of the spokespersons, one of the spokespeople said, we respect the rights of all content creators and believe we created Nemo in full compliance with copyright law. Copyright law. Excuse me. But again, I think that if you have API access and try to pull these things, lots of lots of ways to kind of prove these copyright claims. But again, it is it is a little bit difficult when these providers are not being super open about their training data to to do a lot here.

Andrey

We haven't seen Nvidia get into trouble with copyright stuff. So far. It's mostly been, Microsoft and OpenAI. So I guess interesting to see them getting into a lawsuit as well. And, they claim that they, as you said, have full compliance with copyright law. Copyright law is in super clear right now on this stuff. So could be kind of vacuous statement. But yeah, yet another lawsuit to add to a pile to keep track

of. And last story for the section, which is five of this year's Pulitzer finalists are AI powered. So we have 45 finalists for this year's Pulitzer Prize for journalism, and apparently five of them are used AI in their research, reporting or storytelling process. And this is apparently because the awards required entrants to disclose AI usage. The Pulitzer board added that requirement. Due to the rising popularity of generative AI in the past year.

And, it seems like they did not consider restricting AI usage, but they did say you had to disclose it and very disclosure. It seems that people are indeed starting to use it. And as usual, just want to finish up with one fun story. As I said, this would be a shorter episode, so we are almost done. And the fun story I picked out this time around was, related to Picker Labs. As we covered before, there was an article I made, my Superman action figure talk with Picker Labs new AI lip sync tool.

So just a little demonstration of what you can do with it. And, yeah, it has a little action figure of Superman, being lip syncing to some, prompt. Says some other examples in there as well of giving voice to action figures. And I found it to be a pretty fun example of what you can do with it and how you can, you know, create little creative short videos on YouTube, for instance, now with, much easier way to do lip syncing than presumably, would be possible without this.

And that we are done with this episode of last week. And I thank you, Daniel, for filling in the role of co-host.

Daniel

Thanks for having me.

Andrey

As always, you can find the articles we discussed here today and subscribe to our weekly newsletter. Similar ones at last week in that I. And as always, we'd appreciate it if you share a podcast, give us a review or two or, you know, send us an email at contact that last week that I whatever you feel like doing to engage with us just to make us feel like people do enjoy this stuff. But more than anything, we appreciate it if you keep listening. And so please do keep doing it.

Transcript source: Provided by creator in RSS feed: download file
#159 - Inflection-2.5, Devin, OpenAI board update, SIMA, EU AI Act passed | Last Week in AI podcast - Listen or read transcript on Metacast