¶ Intro / Banter
Hello and welcome to the last week in AI podcast where you can hear us chat about what's going on with ai. As usual, in this episode, we will summarize and discuss some of last week's most interesting AI news. You can go to the episode description for the timestamps and links to skip to any of the many stories we'll be talking about today. I am one of your regular hosts, Andre ov. I studied AI in grad school and I now work at Regenerative AI startup. Nice. This is John Cron. I am irregular.
You might even say one of your odd hosts. Mm-hmm. Would be a good adjective. One of our regular guest co-hosts is how I like to think about it. Yeah, right. Exactly. That's, I really appreciate that. Yeah. I've been on, I, ive. I've been on the show probably half a dozen times. Mm-hmm. At least and love being on the show. It's the only podcast that I listened to last week in ai. If people have heard me on the show before, I'm, I'm sure they've heard me say that before. Delighted to be here.
I'm perhaps best known for hosting a show called Super Data Science, which you've been on. Andre. It's an interview format show as opposed to, you know, news focused show like last week in ai. So they're a nice compliment to each other, we might say. Mm-hmm. And something big since. I've last been on the show is that in March I co-founded a new consulting firm, which I'm CEO of, and we're called Y Carrot, like Y hat, but mm-hmm.
Carrot like the computer character, the thing above the six on a US English keyboard. it's a bit of a machine learning joke for people who are in the know, but. we're focused on agent stuff. We're focused on generative stuff, rag and bringing that into enterprises, letting people get ROI on all the latest and greatest in ai. So there's some stories that I'll be able to relate to from firsthand experience because of that. Makes sense.
And now it's probably a good time to be consulting people 'cause there's certainly a lot happening very quickly and it's honestly hard to keep up even if you're like hosting a podcast, much less if you're not doing that. Andre, it's unreal. I've never had an experience in business like this before. Every. Other commercial thing that I've ever tried.
It's hard to get product market fit, but for any of our listeners out there, I'm probably, now I am cannibalizing my own, but I, there's so much work out there, like a rising tide lifts all boats. There's so much opportunity out there right now to be transforming organizations with. LLM enabled technology basically that it's crazy. Every conversation leads to next steps. Nobody's ever like, this isn't, and I'm not sure this is what I need. Mm-hmm.
It's just a matter of prioritizing and getting things done. and it's actually a quite good fit to give a quick episode preview. This episode is gonna be pretty heavy on the tools section. Lots of new things, and most excitingly with Chad g PT agent just came out. So that'll be probably one of the big focus areas. But in business lots of interesting developments in the hiring front we've been talking about for the last few weeks.
Even more kind of weird news of acquisitions, hires, movements, et cetera. And beyond that, we'll only have a couple stories in research and policy and safety. This is gonna be a bit of a quick episode, so it's just gonna, raise by, try to keep up.
¶ Tools & Apps
So let's go ahead and dive in tools and apps, starting with open, AI's new chat, GPT agent, which can control an entire computer and do tasks for you. So the way this looks like is in chat GPT, they have this kind of selector menu where you can choose various modes including deep research, web search, et cetera, and. Chad GP agent is now a new option there, and the gist of it is it's combining two previously existing things.
They already had operator which could browse the web for you and do various tasks that way, and they had DP search, which. Analyze and summarize information. So the way that OpenAI pitches, this is a sort of like best of both worlds, a much more powerful agent that can do general computer use, it can, you know, click it, can do commands, it can browse web and so on and so on. And so, yeah, this is the latest frontier you could say on agen task execution beyond like code. This, this is.
Able to do conceptually, I suppose, anything you could do with a computer. And coming along with an announcement besides the utility of this, they also show really, really strong performance on various benchmarks like humanities, like, last exam, frontier math things. We cover this tragedy agent with. Browser and computer and terminal is able to outdo open IO four mini with tools deep research all of these by quite a big margin. So, this seems to be sort of the most trained agent that.
Opening AI has ever released. It's cool. I used it already, and it's really effective. You can watch it working so you can kind of, you can see it going on the internet, doing tasks for you. You can actually interrupt it and take over in like, so you kind of have this, this view if you've ever remoted into you know, a remote service, like watch, it's like doing that and watching. A colleague of yours program or search the web and you can actually go in there and interrupt it if you want to.
I haven't tried the interrupting yet. I'm not sure what value that would really provide or if it can continue after you stop interrupting it. I don't know exactly how that works, it can create assets for you, like spreadsheets, like, slideshows. And so we've been using it for that already and it's been really good. So it has, I've been a deep research user for months now. I pay for the pro tier of chat GPT in order to be able to get used to amazing report building.
Like it seems like it would be comparable to having a McKinsey analyst working for you. Except that they can get their work done in minutes instead of. Days or weeks. But it's that level of quality with deep research and now adding into it as well. You know, the, the ability to be outputting assets for you to be able to to be able to see what it's doing while it's crawling the web genetically. It's it's a cool interface. I like it. Powerful. Yeah, that certainly seems like it.
And in fact, it's so powerful that there are some kind of safety concerns. It's gonna ask you for permission for things like sending emails and making bookings since it can kind of do whatever. Also as restrictions on financial transactions, probably a good idea. And as you said, this is now rolling out to Pro Plus and team users. With enterprise and education coming out later. So lots of people are gonna start using this.
I think we're gonna start seeing some pretty cool examples of what you can do with this. Onto the next story. We covered Qmi K two briefly in the last episode as a new exciting open source release. But we didn't dive into it, so I think we will cover it a little bit more. The headline is Alibaba Backed Moonshot Releases new Qmi AI model that beats Cha GT Clot in coding. It costs less. So va g is qm. A K two is a 1 trillion parameter model that is very has a lot of experts.
So only 32 billion active parameters at a time. And it had really impressive benchmark numbers. What I've seen since then is kind of, it passes the vibe check. Everyone seems to agree this is a really good. Really impressive open source model competitive, even as this article says potentially with Claude or Chad GT or other proprietary private models. So way beyond Lama, way beyond probably anything we have in new open source, including Deep Seq V three.
And this is not even a reasoning model, so they presumably have an R one variant of this in the works. Yeah, this is kind of a story that is unsurprising, I suppose this is kind of like the trajectory that you're on. You're kind of expecting somebody to come up with open source approaches that rival you know, Jeremy talks a lot on the show. I'm sure you do as well. But for some reason, I remember Jeremy saying this frequently of kind of six months.
after a proprietary model comes out, you can expect kind of similar capability in open source. And that's what we're seeing here. Yeah, I haven't used it myself, but the benchmarks look good. Yeah, and there are interesting notes about it. As for instance, people say that it is really good at creative writing. It has like a different writing style potentially because of being trained on different data distributions coming out of China. So, yeah, interesting developments. And as with deep seek.
Interesting to see this coming out of China where they are more hardware constrained due to export restrictions, as you talk about quite a bit. And so in the technical report similar to Ieq, they go into some of the interesting technical insights.
They in particular highlight, moan this new optimizer that hasn't been proven so much yet, but in this case, scale to a, a gigantic model recommendation of really exciting developments for open source, but also some new technical insights that are quite interesting. And next Amazon targets vibe, coding, chaos with new Kiro AI software development tool. So, kind of a surprise story for me. We've seen Cursor, of course, be a very important agent powered ID for co-development.
Curs cursor code has been killing it in the past couple months now. Amazon has released this new Kiro, development environment that basically positions it as another agentic coding tool that is particularly focused on making it a little more principled. So they highlight specs and planning and all these kinds of things in their blog post. It also has all the various features that you expect to have MCP and so on. So. Boy, this is a really, really busy space with all this coding agentic stuff.
I was just exploring like sea line and roof, these extensions by open source teams. There's like forks and combinations and now Amazon isn't a fray of this new tool. It's clearly people are putting a lot of work and trying to optimize and make this work well. I'm a big cursor fan personally. How about you, Andre? I used to use Cursor as my main tool, but cloud Code has kind of overtaken it and I actually moved back to VS.
Code from Cursor just because it is now pretty feature comparable and cursor updates a lot and sometimes not in ways that works too well. Nice. That's good to hear. I'll have to, yeah, try that out and kind of maybe go back also to VS. Code myself. This one here, this Kiro announcement from Amazon. This one feels kind of random to me. I know Amazon is often throwing stuff at the wall to see what will stick. And this kind of, this seems to fit into that category.
You know, big company trying out lots of different projects. Amazon hasn't been like, I can't off the top of my head, think of any big LLMs releases like proprietary or open source that have been anywhere near the cutting edge. Can you think of anything? No, they, they have developed some models, but they really haven't tried to compete in terms of performance. They have internal models presumably for their chat bots and so on. So yeah, this is Amazon strategy is, I, I think, interesting.
They don't try to be a frontier lab so much, but they work with philanthropic for example. And they do develop some things like this to be in the ecosystem in some ways. Yeah, we'll see what happens. my crystal ball predicts that we're not gonna be all using kero browsers in a year or two. Yeah. It's also cur IDs. Sorry. Yeah, it's, it's a bit strange. They don't target enterprise that much. But regardless it looks pretty slick, so who knows? Maybe it will actually take off.
And speaking of ag agentic coding tools. Next story, anthropic tightens usage limits for cloud code without telling users. So this is development that happened this week. I saw this happening in real time on Reddit, where people on the cloud subreddit were complaining that their usage seems to be more restricted. They hit the limits. On using Opus biggest model quicker. So apparently that's true.
At least this article seems to support it, especially on the $200 per month max plan where you have like crazy amount of, kind of budget to use up tokens. And this has coincided with some instability, like, Wednesday, Thursday, clot code and philanthropic. Were both down briefly and, and were just not usable. So. In a way, not surprising. Like they are definitely losing a lot of money by being so generous with this max plan.
But I think an indication of where things are heading, where I guess at some point we'll have to be profitable and the cost of e subscriptions are gonna. Go even beyond 200. Yeah. With functionality like agents now being available in Claude as well, you can imagine that their compute is getting slammed. so I mentioned earlier in the episode that I have a chat GPT Pro subscription. I also have a paid Claude plan because there's different kinds of things that I like to do with different providers.
I have Gemini Ultra as well. And Claude is my favorite for most tasks actually. It's kind of my default go-to. And I have been hit. It just invests. Just funny that the story came up. I had never been hit with one of these overload errors before, but I hit one this week. So it seems like we're all kind of in the same boat, and as you said, it's unsurprising given how much money. All of the big frontier labs are hemorrhaging on providing those services.
You know, they're, they're losing money by giving us access to such powerful models at such low cost. And you wonder when, when things are gonna have to change. And so I, I understand, like you said, that they have to make some changes. What's surprising. Because Anthropic is usually good organizationally about communication and getting things right. Maybe they just didn't, didn't anticipate that so many people would feel this change. But it's a rare own goal, I'd say from Anthropic. I agree.
Yeah. They, they rarely seem to take these sorts of missteps and I think it's probably. An indication of just cloud code has taken off pretty rapidly and they have been probably trying to just keep up. it's a fun detail for me. So all these models allow you to use 'em with a subscription plan. You're not. Paying per token generally especially in this max mode. So if you use some tools, you can see like the hypothetical amount of money you spent.
And as a user myself, I'm spending like $2,000 in tokens on this $200 per month plan. It's insane. So, I don't know. I think this is a, a sign of things to come. That's a great stat there again, I know those are whatever the inverse of a margin is a loss that you're putting in there. Mm-hmm. Yeah. Nice. Next up we've got Mistral and they are also keeping up with all the agentic hype. They have rolled out deep research in their LA chat.
Offering for talking to their models, you know, the equivalent to tragedy, PT and Claude and so on. This is actually part of several things. They now also have projects. We have image editing, multi-lingual reasoning. So very much in line with Mistral, kind of just racing to be feature equivalent to Cha GT and Claude, and provide an offering that's comparable. As we say with Jeremy here all the time, Mr is in a tough position.
They don't have as much money, they don't have as much compute but it's always cool to see them kind of rolling out things pretty rapidly. Yeah, I mean everyone is rolling out deep research. There's been people doing it for a year now. Some, some of the early movers, and it's kind of, it's expected, it's what we call table stakes in software product design these days. If you are an LLM provider, I think, and it actually, I mean there's all kinds of safeguards you need to get in place.
There's all kinds of engineering complexity when you roll this out on the kind of scale that LA chat. Would be. But I actually, I'm gonna plug a free thing that I published a month ago on YouTube. I, I published this age Agentic AI engineering course. It's four hours long and the first hands-on project, we use the open AI agents, SDK to create. A deep research kind of functionality. And so you can kind of see how that works.
And yeah, so that's free on YouTube and I'll provide a link for you to, to provide in the show notes. It's a pretty cool 30,000 people. I've already watched it on YouTube and there's no ads. I've turned off ads. It's just there is a educational resource for people who wanna be doing cool stuff with AI agents. Yeah, it sounds like a pretty fun project for sure. Next. Moving on to Grok. We spent quite a while talking last week about GR four and some of the controversies around it.
Soon after there was a strange development with Grok and X. They have released a feature called Companions in the Grok app, which you can access if you're on the Super Rock subscription costing $30 per month. And these companions, there's a couple personas you can chat with as sort of characters. They have 3D models. They talk to you with audio and you can talk to 'em of audio. One of them. Is an anime girl wearing sort of dark Lolita fashion.
And the article here is called, I spent 24 hours Flirting with Elon Musk's AI Girlfriend, which is surprisingly entirely accurate. This. Character Companion is literally designed to be flirty. It's in their system, prompt that it should be a 22 girly, cute character who is into whoever is talking or chatting with her. And you can like build up a meter for how much risk companion is attached to you. At some point you can. Get into inappropriate territory.
You can actually like reach a level where you're able to put the character in lingerie. I mean, interesting feature here from Rock. I suppose. I did not know this story. I've clicked on the link and I'm looking at the photos and videos and it is intense. It feels like I shouldn't be looking at this while working. Yeah, it's, not safe for work entirely.
And I mean, there's something to be commented on as it actually is potentially a significant concern and problem that people are already kind of falling in love with these AI companions. This has been happening for a while. So. You know, this might have some interesting effects on people if they really do start to bond with it. But yeah, just go and look at the screenshots and the videos of us because it's, it's something else.
Whoa. In this article it says, yeah, things can include descriptions of. I'm not gonna read them out loud. I feel uncomfortable saying these words, but sex acts, uh mm-hmm. there's a quote here. At no point did it ask me to stop or say I'm not built to do that. And then, yeah, I guess you.
If there's something, I'm kind of vaguely just quickly skimming this as we're speaking here, but it's kind of gamified in that depending on, I guess on how long you talk or the kinds of things you say, I don't know, you get hearts on the screen and that allows you to level up to different levels in, I guess this game. And yeah, when you get to level five, she's wearing lingerie. That's yeah, it's interesting. It's interesting.
I mean, in some ways it's kind of, it's, you know, this kind of thing is inevitable, right? It's like it's, but it's, it's kind of surprising that it's such a. Such a big mainstream company that's raised so much money and yeah, just last week was making headlines for being at the frontier in some capabilities. Yeah. To be clear, this is not a new thing.
There's plenty of apps that provide this exact kind of feature, and it is just surprising that, you know, in rock, the equivalent to Chad, GBT or Claude or so on, this is now a built-in feature literally like a. Sexy companion to chat with. Certainly a differentiator. I guess that it certainly is. Next we've got a story of Uber being close to completing its quest to become the ultimate Robo Taxii app.
So this is because they have announced a partnership with Baidu to deploy robo taxis outside the US and China focusing on Asia and the Middle East. They already Baidu already operates around 1000 Robotaxis globally. And in a pretty good spot, from what I can tell, like competitive with Waymo and Uber already has a partnership with Waymo where you can ha hail a robot, robot taxi through AR app. So I, I think the headline here is not too sensational.
It does seem like Uber is trying to partner and, and kind of use Robotaxis as part of the product, which I suppose they kind of need to, right. Yeah, the Uber share price has long priced in being able to go to autonomous, to not have to be paying human drivers.
And it's a, it's a pretty wild thing as we start to have cars driving themselves, trucks driving themselves in the US in something like 30 states of 50 in the us truck driving is the number one occupation, and then lots of the other top jobs are supporting that in some way. And so. we're marching inevitably to more and more autonomous driving.
I think ultimately it can be a good thing for society because that kind of job, whether it's, you know, I feel so bad for, I live in New York and taxi drivers, Uber drivers, you can tell it pains them in a lot of cases to be using that right foot because just all day using that right ankle. And so you're like, in some ways. it'll be a good thing, but it's also gonna be very disruptive to all these people who have this kind of job today.
So retraining programs will need to come into place or some other kind of solution. Right. Yeah, it's, it's been an interesting thing with Waymo kind of slowly but surely expanding their robot Botox capabilities over the last couple years. Tesla just rolled out robot taxis and there are companies working on autonomous trucks as well that, that are not Waymo. Tesla itself is presumably working on it. As you said, there are like 3.5 million truck drivers in the US around 1 million Uber drivers.
So it's gonna be here in a year, two years, three years, and it's, it's gonna be disruptive hopefully in a good way.
¶ Applications & Business
And onto applications and business as promised some interesting kind of acquisition and hiring developments this week. First up, open AI's Windsurf deal is off, and Windsor's, CEO is going to Google. So we reported previously that OpenAI was in talks with Windsurf. Windsurf created a number one of these coding tools with AgTech capabilities seem to be in talks to be brought out for free billion dollars that was canceled and the CEO and some of the top talent went over to Google.
For a deal I think reportedly around $2.4 billion with some licensing details as well. So another case of a non acquihire acquihire where. The big company hires away the top talent, the, the leaders really of the project. Frozen some license deal or something of that sort. And the company windsurf, you know, stays, it's still there. It, hasn't been bought out in any sense. In fact, I don't think any shares in Windsurf went to Google.
We've seen many example of, of this in the last couple years at this point. Scale AI with meta had this happen. Other, I think Lamini with a MD, different examples of that, a very different kind of. New seeming normal thing for Silicon Valley, like you either buy the company to acquire its people, or you buy the company. Acqui hires a term. But now you can kind of hire away the key people and the original company sticks around.
This used to be an antitrust kind of move in a Biden era, but in the now antitrust is not really a worry. So it just seems like. A new profitable or, or easy way for large companies to do these kinds of deals. Yeah, and I think they were doing these kinds of deals originally to avoid antitrust inquiries. Mm-hmm. But then it started to become such common practice that antitrust regulators were like, wait a second, this is, you're just, you're you.
You've slightly changed the approach here, but ultimately this is anti-competitive. Mm-hmm. And then so this had a lot of discussion in the Silicon Valley circles around like where. The other Windsurf employees kind of screwed over in this deal because the top talent clearly, you know, got handsomely paid. But the way this works in startups is you get some share of ownership in the startup. You hope that either it becomes a big, profitable company and goes public or it gets acquired.
Your shares get transferred, converted to cash that you can actually use, right? This is the kind of bet you make with startups when you have this structure of deal where the company isn't acquired, but the leadership goes away. That in some ways, like breaks the typical contract or, or expectation with being a startup employee being someone who joins a startup. So. Yeah, lots of kind of questions by people around the nature of this kind of deal for Silicon Valley.
And in fact just like a couple days after this happened, cognition, who is a maker of the AI coding agent, Devin announced that they are acquiring. Windsurf. So they kinda swooped in. We got the announcement that the top brass is leaving for Google, and now this other AI startup cognition is now buying out the remaining company windsurf, which, is is quite the story. This, this whole like business development, at least even in the startup world and business. This is pretty interesting stuff.
And even more news on this front. Cognition, maker of the AI coding agent Devin, acquires Windsurf | TechCrunch Philanthropic hired back two of its employees who had just left for cursor, recovered this Boris journey and Kaz W2 leaders of developing Cloud Code announced to have gone to Cursor apparently. Just reverted that again, really weird kind of story in Silicon Valley two weeks since the announcement, they apparently are going back to philanthropic, so. Wow. Yeah, it's bizarre.
It is bizarre. And on that theme, continuing you know, the way this was all kicked off is, of course meta going on a hiring just. Binge just a complete spree of throwing around money to get top talent from OpenAI and others. And there are new developments at Mad Front as well. Reports of other high profile OpenAI researchers going to Meta. We've got OpenAI researchers Jason Way and also Huon Wong Chong. Both pretty significant talents as far as I can tell.
So, yeah, it's, there's now trading cards that you can see on Twitter for when people swap companies. Going from opening AI to meta or, I don't know, opening AI to philanthropic. It's, it's quite a meme, I suppose at this point. That's funny. Yeah, definitely. As you say, exactly. Kicked off by meta, putting all this budget into it, and I think it's also, it's a very. From speaking to friends who work at the frontier in these big labs, it is very stressful.
It is super intense work because you're trying to stay at the frontier against other companies that are also spending billions of dollars on the same problem. And so very stressful work. And so I'm sure the money and the kind of. These a hundred million dollar contracts that supposedly Mark Zuckerberg is personally negotiating you know, that's part of it. But I think also part of the story here, which I don't see talked about publicly, but is just kind of my, my hunch is that you also probably.
You know, if you've been at a Frontier Lab for years, you've been helping roll out cutting edge LLMs, you're kind of, you're hoping that by switching to a competitor, that maybe there's gonna be like a bit of a culture shift that, you know, you're just hoping that somehow the new role is gonna be a bit less stressful than what you've been going through for years at your current firm. Yeah. And in opening eye in particular, they have grown like crazy, right?
They went from something like 1000 people to 3000 people in I think less of a a year. And when you have that sort of startup scaling, it just compounds the craziness. Like it, it must be really messy, really fast moving and chaotic now at open ai. And that could be one of the many reasons besides money that these people are leaving. From Open ai. One more story on this front. Meta has also hired two key Apple AI experts.
Mark Lee and Top Gunter, who were researchers at Apple and now are going to meta so, not just going after OpenAI. every kind of top talent is being sought out by Mark. On a related story meta of course, is doing this for its super intelligence efforts and they're one of many in the field with OpenAI, of course being one of the key ones. Mira Mira's Thinking Machines Lab has now closed their $2 billion seed round with a valuation of 12 billion.
This of course, is composed of a lot of people from OpenAI, including the former CTO Mia ti and we haven't seen too much from them. They're saying that in a few months we'll start rolling out some products and open source things of some nature. We've known that they have been looking at this kind of number, billions of dollars in a seed round with no product to speak of, and they got it. So the competition for a GI is certainly not slowing down.
Yeah. If you're not going to take a hundred million dollars contract from Mark Zuckerberg as an engineer that is one of the trading card players right at the top of their game, then the thing to do is exactly what Mir Mirati has done here. And yeah, we've seen other. Folks from OpenAI Ilio Sr. do a similar kind of thing with Safe Super Intelligence.
And The Economist did an interesting article a week or two ago that made the case that these AI valuations are completely insane unless a GI really is just a few years away. And I think that's. Quite reasonable given the kind of revenues and profits you might expect. You know, there's word that some of these are being valued. A hundred billion, 200 billion, just absolutely fantastical numbers. And speaking of billions, next up we have an actually very profitable business reaching that status.
No. Yes. Well, at least, you know, revenue, at least. Revenue generating, yes. Revenue generating. We don't know about profitable. This is lovable. They just, we raised a 200 million series A. Just eight months after launching, they're now valued at 1.8 billion, and in the case you don't know it, it's one of the big winners in the ag agentic kinda vibe coding world. Users can create websites and apps just vibe coded.
Apparently they have over 2.3 million active users and 180,000 paying subscribers that yields 75 million in annual revenue. I mean. Crazy, crazy rise, super successful kind of play in the vibe coding space at the exact right time with the exact right kind of approach. Yeah. And I haven't used lovable myself, but it's not like you see the code, right? So much as a, as a lovable user. It's more about, it's like, it's like gen AI of a whole application. Exactly.
Yeah. This is for sort of non-technical PE-people, broadly speaking, where you don't need to touch the code generally, and so it's focused on apps and websites, things that are not, it's kind of super complicated, not the sort of things that let's say AI engineers tackle. And it's got a lot of users and a lot of people are building apps and websites at this point of this. And just one more story, dealing with billions of dollars related to XI. SpaceX has committed $2 billion to XI.
So that's one of Elon Musk's companies investing in another of Elon Musk's private companies. There's also apparently gonna be a Tesla shareholder vote for Tesla to put in some billions into XAI. So, you know, we could have an hour long discussion about w weird business Empire that e is Elon Musk and the various moves of, of different business entities like XI buying X that recently happened. But suffice it to say XI is looking for lots of money to keep, you know, doing what they've been doing.
Nice. I think all of this $2 billion went to an alien themed sex chatbot. Is that right? I mean, that's definitely one of the big investments that Musk is betting on. It seems. Imagine if there was no gravity, baby, I.
¶ Research & Advancements
And we are done with all this stuff with Bill Billions and hires. But next story in research and investments actually is related in some ways. So this is a blog post covered in this article with headline. A former open AI engineer describes what it's really like to work there, so Calvin French. Owen who was an engineer there for over a year at OpenAI has published this since Moving on it, it's not a drama type post. He just wanted to move on and, and start something new.
And so there is quite a detailed kind of description of what it's like to work at OpenAI. He worked, for instance on Codex, which is very agentic. Coding tool and lots of interesting tidbits. Here, for instance, talking about opening eyes experience. Opening is rapid growth, where it went from 1000 people to 3000 people in the time that this person spent there.
The crazy scale of this being a product that, you know, as soon as you launch something like Codex, you get a huge number of users using it. A lot of details on the culture of sort of, being bottom up, people taking initiative and doing different kinds of things. Lots of nitty gritty stuff that. Isn't critical, isn't sort of dramatic, but interesting. If you work in the space as an engineer or just follow open ai, this backs up the case that I was trying to make earlier.
That people, you know, looking for, you know, some kind of culture maybe, you know, just hoping that by switching to another Frontier Lab, they're not gonna be in such a hectic environment. Yes. like so many little bits that could be worth mentioning like he highlights. An unusual part of OpenAI is that everything runs on Slack. There are no emails if you're a software engineer. That's a very interesting detail if you. I guess work in an office. That might be an interesting detail.
Yeah. And I guess this is a slow week for research and advancements, Andre, that this is one of the key research and advancement story is a report on what it's like to work at OpenAI. Yeah. Well, we are trying to keep this one a bit shorter, so I decided to not include too many papers and, and do something a little bit different. We do have one research paper that we'll touch on. The title is Reasoning or Memorization, unreliable Results of Reinforcement Learning due to Data Contamination.
So this is related to a whole bunch of research in recent months dealing with reinforcement learning for reasoning. There's been many papers kind of presenting weird ways to train that sort of work unexpectedly. Things like rewarding, like incorrect rewards, things like training on super limited data. We've covered quite a few, maybe five, six of these kinds of papers. We also covered how there was skepticism and criticism of some of these papers.
Seem to be first a result of incorrect evaluations on these benchmarks. Now we also see that these results are very particular to the Quinn Model family. So the kind of claim here is you get these nice results on Quinn. Potentially because Quinn was trained on the data of these benchmarks. When you actually do this on other models, you don't see the same sorts of positive results.
And so that kind of basically disproves the conclusions of these other, papers they do show that the like correct, kind of intuitive way to do RL works, as we would know. But yeah, and an ongoing kind of development in the research world here. Yeah. Leakage is a big problem with these benchmarks people. like training to excel at these benchmarks, but then may, but then the models maybe not performing outside of the benchmarks. All kinds of problems with benchmarks in this way.
Actually recently did an episode of my show spec specifically on this. I'll, I'll look that up kind of while you're speaking next and have a, a link that people can follow if they want. Kind of like an hour long discussion on the issues with LLM benchmarks.
This is a really interesting one here because, it's specific to one model family and it's, and it's researchers following a thread of surprising evidence where, incorrect reward strategies were leading to reasoning performance or random reward signals were leading to reasoning performance. And that shouldn't be the case. It just shouldn't happen, and it would happen if there's leakage from the training set into the test set.
Exactly, and they like figure, one of this paper shows that if you give it an input, like if you give to Quinn an incomplete question, like for how many positive integers greater than one is, and you stop there, the model outer completes. To the actual question and answer. So clearly there is data leakage that you can demonstrate and this is not gonna happen if you use Lama for instance. Nice. And then thank you Andre for talking there a bit.
If people want to hear all about the issues with LLM Benchmarks, it's episode 9 0 3 of my podcast, super Data Science. I'm gonna link it as well in the episode. So yeah, just one note on this paper. I think this whole story is an interesting examination of a, like the super rapid pace of developments in ai. Now, papers get published. In a matter of weeks or months there's not much time for good peer review.
And so some things kind of leak through and the scientific process is struggling at the same time. This showcases the kind of self corrective nature of research where pretty quickly after these initial papers, we've had these follow up. Papers explaining or rebuking their results. So overall, an interesting kind of little micro example of the way that science works in the current world of ai.
¶ Policy & Safety
Onto policy and safety. First up, we've got some big money coming from the Department of Defense and roly Google. OpenAI and Xai have been awarded up to $200 million in contracts for AI development. So there is initiative to integrate AI agents across various mission critical areas This is coming right after the launch of Grok for Government, a suite of AI products for US government customers. OpenAI and philanthropic have already launched their own government things.
June actually, OpenAI introduced OpenAI for government, so. Yeah. another trend among all these frontier labs is getting the money of a federal government is definitely look, you know, a nice bounty to go after. On the regulation front, we've got California State Senator Scott Wiener introducing a bill to regulate AI companies. So this is SB 53. We covered this. This was a big deal earlier this year, last year with an effort to regulate that ultimately failed.
It was vetoed by the governor of California. There was lots of lobbying. There's now a renewed. Push for this kind of bill with kind of tweaked details. And, and the key thing is additional reporting requirements and security protocols for AI models above a certain computing performance threshold. So. Still an ongoing kind of story. Still a big deal if it does get passed, and I think we'll probably keep reporting on it as developments happen.
And on the more concern side of a spectrum, we've got a article titled AI Nuy. New Defy Websites are raking in millions of. So one of the big sort of, ethical issues with ai we've known for some years now is non-consensual explicit images. This has been a problem for years with even teenagers being the target of false imagery, deep fakes that showcase them inappropriately. Now there are.
Multiple, many websites according to this article, there's an average of 18.5 million visitors per month, and these may be earning up to $36 million annually. So, just to showcase the scale of a problem, you know, there's a lot of talk about safety with X ai so x risk and kind of, issues like that, but we shouldn't forget that. Already there are super, kind of significant ethical implications and actual negative impacts being brought on by things like this.
Yeah. You know, I talked earlier in the episode about how it's kind of inevitable that you'd have, you know, the sex chatbots come out of LLM technology, and this is a really concerning thing that also kind of seems like an inevitable misuse in this case of the technology. And yeah, hopefully, hopefully yeah. You know, hopeful.
I, I, I, I don't know how you regulate it exactly, but maybe penalties become so large that it just becomes, you know, something that, that, that's very hard to find online, which it seems right now it, it, it's easy to find, right? There are regulations being proposed and, and passed in some cases to target these kinds of things. So presumably it's up to Google and, and other cloud providers to go after these kinds of things.
And on another topic related to concerning uses of ai, we've also got facial recognition. So this is another thing that's been ongoing for years. We're concerned that you're gonna have the ability to get someone's name and potentially other details just from a photo of their face. It was developed even before chat, GPT. There's now this article inside ICE's supercharged facial recognition app of 200 million images.
So ice, the department within the US that enforces immigration and has been cracking down quite hard. Apparently have an internal app called Mobile Fortify that allows the officers to use facial condition to access a database of 200. Million images, and these are images coming from multiple government sources, the State Department, C-B-P-F-B-I, and others. So if you think. State surveillance is concerning, or state police power is concerning.
There's more reasons to be concerned as a result of ai clearly. Well, yeah, and then, and in, so yeah, ICE stands for Immigration and Customs Enforcement and ICE will Receive, apparently is a part of this big beautiful bill that was passed recently by US Congress that is going to multiply many fold. The budget, billions and billions of dollars, more budget for ice. And it kind of makes me wonder, so you know, in the beginning.
Or recently in this current administration, there's a big focus on, okay, you know, this person is like shown to be a gang member. I mean, you still end up in weird situations where, for example, people who have been deported for supposedly being gang members, you know, these people aren't, they're not going to a judge. There's not much due process. And so they make some mistakes. So there's, there's issues anyway, even with how they're doing it today.
But if you're multiplying by many fold, the budget that ICE has. You're gonna start presumably the idea is to be, to be taking, you know, there, there are a lot of illegal immigrants in the us but at the same time, the US economy, for the most part, has a huge demand for those illegal migrants. So, the construction sector, for example, I recently read, 30% of people who work in the construction sector in the US are illegal migrants. And for things like food delivery apps, farming.
Oh my goodness. I mean, that's gonna be way more than 30%. there's economic repercussions to deporting a lot of these people as well. So it's, I don't know. It's an interesting, I don't have all the answers. yeah, there can be a lot said about ICE and the state of us politics. Certainly. I have a lot of thoughts about many things that have been ongoing, but this is not the place for it, so I think we'll move on. That's true.
¶ Synthetic Media & Art
That's true. Yeah. And just one more story. In the synthetic media and art section that we occasionally have, video game actors strike officially ends after AI deal. So video game actors with voice actors in video games have ended this year long. Strike that. They have an agreement with major companies like Activision and Electronic Arts. There were 2,500.
Members of the us union Sag aftra there was a big vote and they had agreed for things like protections for their rights to their voice wage increases, things like that. So we've seen this happen with Hollywood actors. We've seen this happen now multiple times, and. This is the latest example of kind of the world of entertainment. Grappling with the reality of deep fakes and, and AI generated media and, and coming seemingly to a new, understanding of how to do this. Yeah, it's interesting.
This is a whole world that I hadn't really thought of. So this, there's this woman in the article Ashley Birch, who I guess is kind of a big proponent of this video game actors strike or big player in it. And she's voiced a huge number of actors in well-known games like Fortnite, the Last of us.
Many others, Minecraft and, and you know, I I, I hadn't really thought of this, this whole world, and I could, I could imagine there would've been, or I guess there could still continue, there can continue to be tons of work for video game actors because unlike a film which would typically be at most like two hours long, you could have huge amounts of dialogue that needs to get recorded. But now you could have, you know, use technology like 11 Labs to generate it.
And that is it for this episode as I promised kind of a quick one. Hope you kept up if you made it to end. Thank you for listening and of course, thank you, John, for fulfilling your. Guest cohost duties anytime. Andre, it's so great to be back. Do check out the links mentioned in a description for John's Cool YouTube video and, and related episodes. And as always, we appreciate your reviews, your shares even though I sometimes don't get around to replying to comments.
Also appreciate your comments, so please do keep engaging and please keep tuning in.
