GPT 5.5 just did what no other model could - podcast episode cover

GPT 5.5 just did what no other model could

Apr 23, 202624 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

This episode explores OpenAI's GPT 5.5 and 5.5 Pro, detailing its performance in real-world scenarios from teaching advanced subtraction to tackling complex tech debt and security backlogs. Claire Vo highlights the model's higher intelligence, efficiency, and ability to execute genuinely autonomous, long-running coding loops, significantly impacting the scope of solvable problems. The episode culminates with a successful, challenging reverse-engineering of a proprietary Bluetooth device, demonstrating the model's groundbreaking capabilities.

Episode description

In this mini episode, I break down OpenAI’s new GPT 5.5 and GPT 5.5 Pro after weeks of early testing. I walk through three real jobs I threw at the model:  building an app for me to teach my second grader more advanced subtraction concepts, tackling a tech debt problem in the ChatPRD codebase, and hacking into a proprietary Bluetooth pixel display that every other model had failed me on. My verdict: higher intelligence, better efficiency, and genuinely autonomous long-running loops that change what I think is worth tackling.


What you’ll learn:

  1. How I think about GPT 5.5 Pro’s pricing vs engineering time, and when I believe the “intelligence tax” is worth paying
  2. Why I treat GPT 5.5 as a developer model first, and why I couldn’t find a consumer use case that justified its intelligence
  3. The exact prompt pattern I use to unlock a long-running autonomous subagent loop
  4. How I got a near-six-hour autonomous run to one-shot 98% of edge cases in a migration over millions of chat threads and drop my Sentry error rate to the floor
  5. Why I’m now throwing GPT 5.5 at tech debt, flaky tests, and security backlogs first
  6. How I combined a Bluetooth packet sniffer and GPT 5.5 to reverse-engineer a proprietary pixel speaker after Claude Code and GPT 5.4 both gave up
  7. How I use the /personality command inside Codex to swap the default “baked potato” tone for something I actually enjoy working with

In this episode, I cover:

(00:00) Introduction to GPT 5.5 testing

(00:40) What is GPT 5.5 and how much does it cost?

(03:23) Testing GPT 5.5 in ChatGPT: the intelligence overhang problem

(07:12) Moving to Codex: where GPT 5.5 really shines

(16:01) Hacking a Chinese Bluetooth speaker

(21:47) Final thoughts on GPT 5.5’s intelligence and efficiency

Tools referenced:

• GPT 5.5 and GPT 5.5 Pro: https://openai.com/index/introducing-gpt-5-5/

• Codex: https://openai.com/codex/

• ChatGPT: https://chat.openai.com/

• Claude Code: https://claude.ai/code

• Sentry: https://sentry.io/

• Divoom MiniToo: https://divoom.com/products/minitoo

Other references:

• OpenAI Codex Security: https://openai.com/index/codex-security-now-in-research-preview/

Where to find Claire Vo:

ChatPRD: https://www.chatprd.ai/

Website: https://clairevo.com/

LinkedIn: https://www.linkedin.com/in/clairevo/

X: https://x.com/clairevo

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email jordan@penname.co.

Transcript

Introduction to GPT 5.5 testing

Welcome back to How I AI. I'm Claire Vaux, product leader and AI Obsessive, here on a mission to help you build better with. Today I have a very special episode for you where I'm going to tell you everything I think about the new GPT 5.5 model, which I've been able to test for the past couple of weeks. Spoiler alert, it is a powerhouse, and I've been able to do things with this model, especially around advanced coding, that I haven't been able to do before with any other model on the

And I'm gonna show you how it breaks my personal high-tech eval hacking into this little computer. Let's get to it. So before I tell you what I built with GPT 5.5, let me tell you a little bit about the model itself. So today OpenAI is releasing GPT 5.5 and GPT 5.5 Pro into Codex and Chat.

What is GPT 5.5 and how much does it cost?

Not available in the API quite yet. And this model I've been testing for the past couple of weeks, and I will tell you what OpenAI is saying is true. They're saying that it has a higher capacity for complex work and is more efficient, including being more token efficient. getting that work done. And so the whole idea with this model is it's smarter and it's more efficient. So you're gonna get more done. And that has really been my experience.

Now, I'm glad it's more efficient because it is expensive. GPT 5.5 is$5 per million input tokens and$30 for output tokens, and GPT 5.5 Pro is which has powered all this work that I've been doing is thirty for a million input tokens and a hundred and eighty dollars for output tokens. So this is a pricey one, but when I reflect on what I was able to achieve with this model in early testing.

I'm gonna I'm gonna pay I'm gonna pay the intelligence tax because I think what I was able to achieve is really important and This is one of the things that I think about a lot when I'm testing these new models or testing these new tools.

You know, everything has an ROI and there can be an ROI in terms of speed. So can I get the things done that I want to get done faster? And that's certainly been an accelerant from an AI tooling perspective and something we've all experienced for the past couple of years. But where GPT 5.5 really helps me is ambition. It has been able to do things that literally I have not been able to do before for a couple of reasons.

One, just intelligence higher is solved problems that other models and other hardnesses other than codecs have really had a hard time with. The second thing I've experienced is because the efficiency is higher. I'm able to do more faster without losing context of what I'm working on because it's happening really quickly, or it's being more autonomous so I don't have to babysit as much. So again, I'm getting more done.

So I do believe that what OpenAI is telling us is true, but that's coming out of my own experience spending hours and hours and hours. With this model, throwing problems at it that other models have really had a hard time with, including GPT 5.5. So let's talk about what I built. And folks, for the less technical here.

One of the things I'm gonna say about the model, and I tested it a little bit in Chat GPT, but not a lot, is that I don't know what to do with all this intelligence if you don't have complex problems to solve.

Testing GPT 5.5 in ChatGPT: the intelligence overhang problem

So while I've tested it in ChatGPT in my personal account, which is what I got access to, I don't have complex high intelligence problems to solve in my personal account. And so it was really hard for me to think of where I would use five point five or five point five pro.

in chat GPT simply because the problems I'm solving there aren't that hard. But I did try to solve problems there. So let's just talk about quickly how I used five point five in Chat GPT and what it gave me. And it'll just give you an indication. of what I'm gonna show you a little bit later. But again, I think what the consumer or even the everyday enterprise business user is going to struggle with.

using chat to B T with this model is how many problems do you have that require Super intelligent? So again, I think this is gonna be a model that developers and software engineers really love. And I'm really excited to see what OpenAI does in terms of unleashing. and boxing this intelligence in use cases that then the quote unquote everyday person can use. So that's a little bit of of my lecture on how much we have an intelligence overhang basically.

What did I ask uh ChatGPT GPT 5.5 to do in ChatGPT? Really simple thing. I'm teaching my second grader two digit and three-digit subtraction. He's actually in first grade, but you know, San Francisco. I'm trying to push him ahead. And so one of the ways that I've been able to teach him is build these little apps that help him understand subtraction with two digits and three digits and learn some kind of uh tactics to do that well.

And so I asked it to build an app for me to teach my second grader more advanced subtraction concepts. I haven't been super pleased with some of the vibe coding tools or quad code on this. Nothing's really uh built this exactly how I wanted. So I wanted to give five point five.

A shot at it. And first out the gate, it's a thinker. So you can see here it thought for 17 minutes, 27 seconds about this. You were gonna have this experience with this model. This is gonna be a theme of this mini episode. This thing will think. And it planned a app for advanced subtraction, built the code, all this kind of stuff. Now, here's my question. Do we need 17 minutes of hyperintelligence thinking to build this app?

Probably not. If I wasn't testing for the purpose of this podcast, Would I have waited eighteen minutes for the sabbat? Probably not. So again, what are we gonna do with all this intelligence? Is this the right form factor for, you know, a non-technical software engineer to access it? Not a hundred percent sure. And it built me a app here. You can see it includes mini lessons, word problems, read aloud.

Fine. It's fine. It's fine. It has different modules in it. The design leaves something to be desired. But again, I'm not really going to the GPT models for front end. I really want them to solve my hardest technical problems. And so I would just say in Chat GPT, I'm unsure yet, only because I'm not sure what the average Chat GPT user is really trying to achieve and how much intelligence is required, even on the coding side. And so

I just wanted to start there by saying if you're in Chat GPT, you're using 5.5, let me know your hard intelligence problems so I can test them. I think the like basic vibe code me, a little simple app, it's fine. It's not great. It's not any more in particular impressive than other things on the market, but it does a reasonable job. And then just the sniff of five point five is it's gonna think a lot and it's gonna give you this chain of thought reasoning here.

to let you know how it's thinking and managing its own own Okay, so I'm gonna put away Chat GPT. It's fine. Let's talk about using 5.5 Pro in Codex and

Moving to Codex: where GPT 5.5 really shines

You all I love, I love her. I do. My initial reaction when I first started testing GPT-5.5 in codex is I am. And what I mean by that is I was kicking off tons of tasks in parallel. Because the feedback loop for fast, the efficiency you felt right away, I was knocking off very long-standing tasks with. tons of subtasks underneath them and I'll give an example of what those are. And I was able to buy it off a tech debt.

Technical problem in the chat PRD code base that I have wanted to take care of for truly months. It has been plaguing me. And GPT 5.5 blasted. So I want to show you a couple of those examples so you can understand what kind of tasks GPT 5.5 plus codex is really good at. And why I think its intelligence is higher and the way it's configured to work autonomously and efficiently is really beneficial for the software.

So the first thing that I did, which I'm not gonna show you for what will become very obvious reasons, is we used OpenAI's Codex Security product to run a threat assessment and security scan on the chat PureD code base. And It was pretty good. We're we're pretty secure, but it did come up with some low priority or low severity issues that we needed to remediate.

And instead of taking those one by one, what I did is I downloaded the CSV of those issues, uploaded it to Codex, and just said, can you please architecturally review these issues, group them if they're thematic, and then propose a change and then make those. And I will say it just did it. It did it very well. We did human review on that. We did code review on that. And we were just really happy with the quality of execution, but also the fact that I could give it a list.

of generally associated but not single project tasks. and it can execute on those well. And the real validation of the quality of that output came when we had uh very quickly after that our annual penetration test and our pen test came back. Super clean. And so I would just say if you have a list, a triage list of technical debt, if you have a triage list of security issues, even maybe front end debt, flaky tests, engineers, pay attention.

You can throw that list at GPD 5.5 and it will get that list done. So that's use case one that I thought was really efficient and great. Use case two, and I'm so disappointed it cleared how hard it worked on this project, but I have, as I mentioned, this lingering tech debt in the chat PRD codebase, which is we have millions of chats now for chat PRD, and we were storing those chats.

In various legacy formats, as the model providers, both OpenAI and Anthropic, have changed the shape of their model responses over time. And so TLDR for the folks that are less. Every model in the world has changed a little bit about how they return data via API. Over the past three years, we have a bunch of debt and data debt around that where we were storing legacy formats in our database.

And these legacy formats, because they are AI calls, because they may or may not contain attachments, because they may or may not attack contain tools. Very hard to build a clean, cohesive backfill and sanitization of that data into our go forward data. And I have just been slapping like fix after fix after fix and patch after patch after patch.

on this problem because every time we patch it we find another edge case. So this is an example of a data migration problem with millions of rows, which might not sound big to many people, but is pretty significant to to us in terms of the complexity of the data inside of it. functionally unstructured, lightly structured data with tons of And I just finally was like, you know, GPT 5.5, take me away. Gave the model that problem and it executed.

So well. It built functionally one-shot a solution that covered, I'm not kidding, 98%. Of the edge cases that we had identified. So, first of all, one shot building a complex migration by pointing things to docs and libraries. Very, very good. Something that really been hard for us to do because it was so complex and so unstructured before. The second thing, which I want to show you on the screen now, is I needed GPT 5.5 and codex to validate that work. And so I pulled a production light.

Set of examples. into a test environment. And I asked Codex, look, I need you to figure out a way to programmatically test every thread that's in local. I pulled a local version of this um production like data. post it to anthropic and openai and any other provider that we're we're using. I need you to make a scalable system for our team to do this programmatically, ideally through a CLI, so that any agent can test any thread for these data issues.

And then I've been saying this a lot to uh GPT 5.5. I trust you. This is my my prompt to GPT 5.5. I trust you to make a call. Figure out how to spawn a subagent to do this, test it and identify any issues, repair them, and get this ready for production. Thank you, because I'm very polite. This thing worked for six hours. It was actually five hours and like 57 minutes. Truly, it just banged its head against the wall for six.

And I did not have to. I zero prompts, zero follow-ups, zero steering. I think I had to approve one um script call or something for it to have access to run in its sandbox. But otherwise, it just went for Six hours. I have not seen personally, everybody says, Oh, I'm getting my agent to run overnight. I have not seen it until GPT 5.5 in a very constrained use case. And so this thing will do long-running autonomous.

Tasks that require sort of a loop to understand if it's doing well and moving things forward. It ran for almost six hours and then it implemented the smoke test. It tested all the example data. And after this, we literally, after two million rows, had one edge case that was. And so just like think about that for for a minute. You know, we had two million rows.

one edge case where before we were hitting edge case after edge case after edge case, six hours of GPT 5.5. And then you know what we saw? We saw our error rate just hit the floor in our sentry monitoring. And so People say that AI coding is going to decrease quality'cause people are vibe coding. That is just such an eighteen months or twelve months ago narrative.

I think quality is going to go up. This kind of problem I've truly avoided because the intelligence was not there to do it autonomously. My ability to and our engineering team's ability to like break down the problem and spend the dedicated time to hitting every edge case in our synthetic data really hard. And, you know, every time you like plug one hole, another one pops open. And just being able to hand this to GPT 5.5 and codec.

has changed my life. So again, I am scared about how much this will cost me in, you know, production when those tokens but like, Cheaper than me, cheaper than my engineering team. And it really did run six hours. And so I'm just like, throw this thing at your quality issues. Throw this thing at your bug backlog, throw this thing at a security assessment, and close the quality gaps or performance gaps or security gaps.

in your app, it does really, really, really well. So that's my prime use case. If I didn't share anything else, um, this would be enough. It bit off my largest piece of tech debt in my app. Basically, made my errors go to zero and did it all six hours autonomously in a self sustaining subject. I love UGPT 5.5. But there is a real eval, and I told you this in the intro. My real eval is this thing. This is a DiveVoom Mini 2 retro PC style Bluetooth speaker.

Hacking a Chinese Bluetooth speaker

And tiny screen. And I have been, I am not kidding. I have been hacking a Since January, since late January or February. I think I ordered it around Valentine's Day. And my only goal. is to be able to display funny stuff on the screen. Now it comes with an out-of-the-box iPhone app. And so I can use this proprietary iPhone app to send

images to this thing, but I don't want that. I live in the terminal. I want to be able to do this programmatically. And this is like proprietary code loaded on this device. I was like very deep in Chinese language repositories and documentation from like Bluetooth hardware providers. I was in deep, y'all. And I threw first, I threw Cloud Code at this.

And I said, can you figure this out? Claude Code could not figure it out, even with Opus. I threw GPT 5.4 at it. It could not figure it out. I cannot tell you how. crazy I went with this, but I'm gonna So this is a little device. You think you would be able to plug it in and just say, dear Claude Code, tell me how this device works, make no mistakes. No, that's not how it works.

It connects to your computer or to your phone via Bluetooth. So it is interacting with this app on your phone through Bluetooth. And in the app, I can like draw something and click send and it will display. So I know that over Bluetooth, I can change the display of this app. But we could not figure out how to encode that mouse. What did I do? Well, this is a little peak. This has nothing to do with AI. This has has a peak to how cuckoo bananas, your friend Claire is.

So what I did is I spent truly hours downloading a Bluetooth profiling profile on my phone for developer debugging. I then hooked it up to Sorry, I'm crazy. Hooked it up to a packet sniffer so that when I was using the app here on my phone and it sent an image to this computer, it would log And sniff the packets and tell me what Bluetooth was sending to this this little guy. I threw these logs and kind of all the information that I had at 5.5, and let me show you what.

So I'm gonna get that repo up. Really quickly and show you my desperate prompting. I said this thing is connected by Bluetooth. Take what you know and please just do anything to figure out how to display on this. You have so much information. You should know how to do it. I believe in you. And guess what? This effing thing Did it. It did it. So I My success. Um, my success measure here, which is I was able to build a command line tool where I can run it in terminal, press enter, let's see.

Did the benchmark hit? Hello, it's Hello. This is months, months, months of trying to hack into this stupid thing. It was encoding and decoding bitmap files. It was crawling the web trying to find if there was some secret SDK. Codex, you did the thing. And even better than that. It is now hooked up so that any time I ask Codex to do a thing. It will alert me on this. So let's give it a little try live on the podcast. And then I will get you out of here. But I am telling you.

This pack into a proprietary device. That is my intelligence test now. All right. So let me share my screen really quickly and let's just test if this thing works. So I have my terminal up and I am going to go into Kodak. And I'm gonna say something really simple. I'm gonna say, what can you help me with? And I built into my codex config a notify hook that should do something on here when it's time to be notified. So what can you help me with, dear codex? It's gonna tell me. And

Let's see, it's done. Maybe I'm not paying attention to my computer. Let's see if it runs. It should make a noise. Your move. Well, your move without the E. Your mom. It made a little beepy boop. You all. This is changing. my life. So again, I did three assessments of GPT 5.5. This is the one that impressed me most. I will share more about this. on the blog. I might even do a little mini up on this particular workflow. I'll try to publish the code.

But you all, this was my delight moment. I screamed. My children were blown away. They have seen me slave over this thing. I was sending them messages and saying, hey, and then like responding to their questions by just showing them the screen. I am obsessed. So GPT 5.5 has hit my intelligence benchmark for can you hack into this? Chinese digital screen with proprietary Bluetooth transport mechanisms and bitmap compression. And guess what? Five point five can.

Final thoughts on GPT 5.5's intelligence and efficiency

All right, so that is a wrap for our quick review of GPT 5.5 TLDR. I love this thing. It is super smart. It is super efficient. And it will work on its own against complex problems, basically as hard as you ask it. It has solved problems I have not been able to solve before. The only thing that I will leave you with it is that it has the, as I call it, baked potato personality that we've all come to know and love from Kodak. Um, it is a doll doll dollard. But

I learned over the testing of this, if you do slash personality in codex, you're able to change that to something a little friendlier. And while some of my fellow early testers said it had Too much of a Gen Z personality, I said I like to stay young. Give me that Gen Z GPT 5.5. I'll take it any day over the paperbag baked potato personality that you get. Other than that, it's my favorite senior software engineer, staff software engineer.

I'm gonna go blow through a bunch of technical work and I really love this model. So I can't wait to hear what you think and if you figure out a high intelligence test that works in ChatGPT, let me know. Otherwise, enjoy coding and I can't wait to see what you build. Thanks, y'all. Thanks so much for watching. If you enjoyed this show, please like and subscribe here on YouTube, or even better, leave us a comment with your You can also find this podcast on Apple Podcasts.

Spotify, or your favorite podcast app. Please consider leaving us a rating and review which will help others find the show. You can see all our episodes and learn more about the show at howiaipod.com. See you next time.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android