Inside image generation’s Renaissance moment - Episode 19 - podcast episode cover

Inside image generation’s Renaissance moment - Episode 19

May 14, 202629 minEp. 19
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

This episode delves into the significant advancements of OpenAI's ImageGen 2.0, highlighting its rapid adoption with over 1.5 billion images generated weekly. Product lead Adele Li and researcher Kenji Hata discuss breakthroughs in photorealism, accurate text rendering, and multilingual capabilities. They explore diverse use cases from personal creative expression (like imperfect "MS Paint" styles) to professional tools for education, marketing, and design, and hint at future "creative agent" functionalities, emphasizing ImageGen's role in an "AI Renaissance" alongside tools like Codex.

Episode description

People are generating over 1.5 billion images a week in ChatGPT. In this episode, Product lead Adele Li and researcher Kenji Hata share some of the new use cases and trends since the launch of Images 2.0. Together with host Andrew Mayne, they trace the progress from the early DALL-E days and dive into the latest capabilities, including better text rendering, photorealism, multilingual support, world knowledge, aspect ratios, and character consistency. They also explore what comes next as image generation models evolve into more capable creative assistants.


Chapters

00:36 How Adele and Kenji came to work on Images

02:27 Images 2.0 launch reception

05:25 Productivity use cases and and 360 images

09:34: Viral trends, authenticity, and imperfection

10:51 Training breakthroughs and photorealism

14:06 Evals, prompting, and creative control

22:16 Creative agents and what comes next

22:27 Images + Codex

28:08 Prompt tips

Hosted on Acast. See acast.com/privacy for more information.

Transcript

Intro / Opening

B

Hello, I'm Andrew Main, and this is the OpenAI Podcast. On today's episode, we're talking about Images 2.0 with researcher Kenji Hada and product lead Adele Lee. They'll discuss why the new model represents such a major leap forward. the evaluations that mattered most during development and what people are creating with it now that it's widely available.

A

If Dolly was the Stone Ages, Image N 2.0 is the Renaissance. It's not only great artistically and aesthetically, but it also incorporates, you know, science, art, architecture, all in one image.

C

We looked at it and we were like, all right, this is better than images.

🎵 Music

How Adele and Kenji came to work on Images

B

Uh Adele, tell me a little bit about how you became a product manager here.

A

So I joined OpenAI a little over two years ago. And before OpenAI, I was an investor my entire career.

B

Oh wow.

A

So I was in private equity and spent three years at Vet Red Point Ventures investing in AI and software companies. And when I first joined Open AI, it was for a completely different role. I was thinking about how do we build out our data and compute infrastructure. And over time made my way over to the product side and for the last six months have been working on Image Done.

B

It it's it's interesting how you sty yourself going from one world and finding yourself into this space here, which is kind of cool, you know, to think about the idea that you have this sort of, you know, ability to be useful in different ways.

A

Absolutely. And I think the role of a product manager is just to do the job that needs to be done, no matter what it is. And for Imogen in particular, it's been really awesome to flex a lot of different muscles when it comes to

uh building products, working with researchers like Kenji, but also thinking about like what is the gap in the market today that we want to fill and what is the opportunity that we wanna grasp here. It's not the same market that it was a year ago when we first released Image Done one point oh.

Now it's a very different landscape. There are multiple image generation makers out there. Um and Chat GBT is a very different company and uh product itself too. And so Um, really thinking about the evolution of ImageDon and its role within ChatGBT has been really, really exciting to me.

B

Kenji, how did you end up working on images?

C

Uh actually like when I first started at OpenAI, I also started about two years ago. Um, I was working on like some random audio project initially, just was my first project. And then at the time I just found my way just working on helping them work on uh images one point oh the prior to the launch. Um And so gradually I moved more and more onto the project and then I became full time on it basically.

Images 2.0 launch reception

B

What has the reception been like right now for the moment?

A

model. In the last two weeks since we launched the model, usage is up more than fifty percent. More than one point five billion images are generated every week on Chat GPT. And we've seen viral trends emerge across the world. um all the way from trends in Asia for color analysis and stickers to US for crayon and scribble or going viral. Um, but also a lot of people exploring emergent use cases. And I think it shows the dynamic range of the model, but also how people

are able to visually grasp the advancement of the model almost immediately. I think the visual uh communication reaction that we've seen from our users For them to say, hey, this is the best, highest fidelity, highest quality, and aesthetic model that we've seen has been really awesome.

B

This felt like a really big shift, almost worthy of me being not even being a an images too, but almost like just a new paradigm because just the capabilities are through the room. What made that possible?

A

When we started working on this project, I think we sat down and we discussed What is the step change of capability and use cases that we wanted to build towards? Um, and we believe that image generation has the ability to do so much more than it what it does today. You could distill every single output uh or visual content that you see today into an image. And so that was the mandate that we sought out to improve. Um, and with this 2.0 model, we've improved on various different dimensions.

Um, one is text rendering. The ability for text on a page is so much better fidelity. The language and words actually make sense and they're actual words. Um The second of all is multilingual, so we've really focused on making this model work in various different languages. And we're already seeing that people across the world in Asia and Europe are really resonating with these advancements. Um, the third is photorealism. I think we really saw a lot of feedback from our previous models that

uh the output wasn't very realistic or altered their face or their bodies. And so one of our mandates was how do we actually make the image feel like more like yourself? And so all the things that you think That the model knows it does because it has imbued the knowledge of the world into um its conscience and is able to visually communicate that back to you as a user. And so putting that all together.

I think we really get a state-of-the-art image generation model that is the best aesthetic model out there on the market right now. Um, that really represents a new paradigm for image generation. Um, which is a huge part of I think AI progress at large uh that that we have an opportunity to work on here.

C

We often listen back, uh listen to feedback on social media too. So we kind of just take all these things and basically are just aware of it and try to make sure that they're mitigated or completely fixed in some cases in in the next iteration.

Productivity use cases and and 360 images

B

What kind of use cases are you seeing? What are you seeing people do with this now?

C

I think one that's particularly close to like the research team as a general is like infographics, tech. Um, I think text in images is like so much better nowadays. So um I think it just opens up a lot more productive use cases. And at from like the research side, we th kind of think, you know.

Image generation used to al always be about fun and maybe like unproductive things, but now we're really seeing steps forward into productivity and uh image generation for any type of use case that you can imagine it for.

B

So you mentioned text. I remember the early models, uh no disrespect to chimpanzees, but getting into the spell like opening I even looked like a chimp did it. And then now I'm looking at pages of text and finely detailed stuff. And I know that as models get smarter, variable binding, the ability to put things next to each other improves, but this was just a big improvement.

C

Yeah. But I don't think it's like completely unexpected. I think you you see a lot of growth in between uh well, first you see between Doll E three and you know GPD images one. There was uh If you ask for a grid of random objects, You go from maybe like five to eight in Dolly three to maybe around sixteen in images one. And then with one point five.

we went to about twenty five to thirty six um consistently. And I think now we could probably do m over a hundred. I think this is like a test that we might do internally is just Um we just has chat GPT, gener give me a list of a hundred random objects, right? Yeah. And then we just send that to our image generator and see how how many are correct. And usually, you know, it'll get almost all hundred.

Uh and that's but you see the the constant growth over time. Um so I don't think it's like completely unexpected, it's just a steady pace.

B

That was a test I used to use for like the really old models back with like Ada, Babbage and Curry, like list a hundred science fiction books and then some of'em would get by the time I got to like twenty two would just start repeating stuff. So we've seen stuff too, like 360, 360 degree panoramas. How did that happen? Yeah.

A

That really came from the emerging capability of the model, which is the ability to render images in any aspect ratio. We discovered that people were generating really long, amazing panoramics, you know, skinny bookmarks as well. And one of the cool capabilities of the model is that not only were you able to generate images in this panoramic aspect ratio.

but you could also render images in the style of 360. Um and we saw that it was really fun to actually view these images in a 360 world itself. And so that was a really fun feature that we ended up adding into the product and it's available.

B

Thing I did was I made a version of dogs playing poker. I put that in there so you could sit there like you're one of the dogs looking around in there, which was not something I expected, but it's it's

A

It's fun. Yeah. I mean it's really awesome to see how people are exploring new use cases and fun things that they're creating with the model, um, even far beyond what we expected users to be using it for. think when we were designing the model, um, we were really deliberate in understanding what people really wanted to see from image generation.

There was a lot of latent demand in image generation. You know, people were mostly using it for personal use cases, but we definitely saw a lot of inklings of uh people wanting to push the model in certain directions that the model wasn't good at. So text rendering was definitely one of those dimensions that we really wanted to improve on. Multilingual was another. And I think world understanding generally is so much better in this model.

And that typically means that, you know, now people online are sharing a bunch of examples of them creating image done for all different kinds of use cases that we didn't even think, you know, existed out there. So I think the model's understanding of aesthetic beauty across multiple different outputs, whether that it's like a fun meme, um, an image for a five-year-old versus a professional consulting deck. The expansion of opportunity and outputs is has been amazing to see in this latest model.

B

It's funny too how one of the things that was trending was taking popular images or photos of people and then having the model make like kind of janky looking Microsoft paint versions.

Viral trends, authenticity, and imperfection

A

Yes.

B

And did you think that was something you would see was that people are going to use this incredibly capable tool to then go make, you know, these silly looking things?

A

Yeah, it's funny'cause it takes a lot of intelligence to actually create something that is imperfect.

B

That's what I tell people all the time.

A

Yeah. And it's definitely very interesting in the viral trends that we're seeing online right now. Um, one thing that I think people are really striving for is authenticity, imperfection, nostalgia. We're seeing that in the MS Paint prompt, crayons, um, all different kinds of generations that people are creating. And that really feels like the theme of consumers is

They wanna interact with AI in a very authentic, imperfect way. They wanna show their imperfections and use AI to help make them look good, but also show a more fun and goofy side of themselves.

And I think that's self expression via AI is something that we're really excited about. And, you know, I think it's really part of our mission as a company to make it easier for people to learn more and distribute that intelligence, but also letting them express a version of themselves that maybe wasn't possible before.

Training breakthroughs and photorealism

B

Kenji, was there a moment with this model where you're saying to yourself, wow, I think Think this is ready to go.

C

You know, as it's training, we take a checkpoint and then like we just sample from it, right? And just see, okay, how good is this thing? And I think like We just sampled them, checkpo a model uh uh an image and

We looked at it like

C

All right, this is better than images one. Okay.

B

I remember watching the iteration of one of the early versions of Dolly and how at first it was sort of the wispy, sort of weird, sort of the tendril sort of thing, and talking to one of the researchers like. Is is that gonna go away? It's like I think two, probably two runs away from that. And then just like that. The ability to predict that was amazing to me, and all of a sudden everything got crisp and clear. Yeah. And then also like looking at

you know, years ago I'd played with like, you know, GANs and like doing those things, you'd you have to squint and say, I think it's a pickup truck or something like that. Yeah. So it's interesting what you see as you say, okay, this just all of a sudden got much better.

C

Yeah. I mean it was just very obvious. You just you just take the early checkpoint, you just sample an image from it, and then you just sample an image from uh, you know, images one and you just look at the two and you're just

B

There's no why don't I like this garbage? This is a

C

But what the image was, it might have just been like a picture of like a woman at a sea on the seaside, like yeah, you know, overlooking a seaside. We just looked at it and we're like, All right, this is like no no question.

B

That was the big the big the big jump was the photorealism of going from something that looked that was more of a a glossy idealized magazine cover to something that looked like a really good photograph. Help me understand, like besides just more compute, how did this happen? How did you get a model that's much better and also that doesn't take an hour to generate an image? The times are still I I remember

in the Dolly days, like we would literally have to, you know, tell tell us what you want. And then an hour later it'd be on Instagram to now these things are in chat GPT and it's faster. How is it getting both more intelligent and you're maintaining the same speed?

C

I think we learned a lot uh in each release, like between one, one point five, now two. And so we take each each of the learnings that we've made and we've, you know Like for example, speed, right? Um, you know, one of the things is like, oh, can we make the model more token efficient or or or something like that? And uh, you know, we did a lot of work to make it to make it put produce very good images with what

A

I think the post-training for this model was very interesting in the sense that uh we really had to think about not only does the model understand world knowledge and how things look and you know, science concepts, um, math, et cetera, in an image.

But also what is the taste that will resonate with users? You know, what makes the model or output beautiful? Um, how do you make it look realistic? Um, these are all questions that we had to grapple with when we were post-training this model. Um, because I think That one of the things that was really important for us was that this model was the strongest aesthetic model out there right now, which means that it has.

um more creativity um in various different outputs, no matter what that output is, if it's a professional output or a personal output. And so that range of training and the range of use case, I think, made training this model a very interesting problem.

B

Do you have any personal favorite benchmark tests you like to do? Things where you say, I want to see it make an image of this.

Evals, prompting, and creative control

A

I have a eval that I call the Me Me Me Eval. Okay. It's essentially a hundred photos of myself.

B

Yeah.

A

And my friends and my family. Um and I put everyone in goofy positions. I have about a card or birthday for every single person. Um and I think it's a really great eval in the sense that uh you only know the people around your, you know, faces the best. Um, you also want to create funny things with the model and th do things that are relevant. And so one thing for me as a product manager um that I'm testing is not only

Is the raw capability of the model really great? But also does ChatGPT understand what I want in that context? You know, ChatGPT remembers. you know, that I have a brother, that I have a mom and dad, um, and what they like to do. And so does the model accurately know how to insert pieces of personalization in the moments that matter in the images? These are things that I'm testing for.

B

How about you?

C

Besides the grid one I mentioned earlier, that's probably the one I've used the most. For a while, I think Dibya and I were doing a lot about photorealism. We were trying in real hard to push on that. Um uh just basically I know Divya's favorite one was like A woman holding an or a jug of orange juice.

A

Yeah.

C

There's like so many images of a woman holding a jug of orange juice.

A

Well I actually feel like the researchers had a more standard set of images like than they like to lead on.

B

And you get like the standard can it do somebody writing with their left hand and a watch on their right hand and a clock showing this. I think the big the big leap of the image is like probably one or one point five was like a half full glass of wine.

C

The wine glass folded the room.

B

Yeah, full of the brim. Yeah, it was exactly. And there were ways I was able to prompt it to do it, but it was oh it was really had to get a really descriptive like, you know, red liquid inside this. This one is so fun to prompt. There was a thing people said, oh, can it do like, you know, can it do like pixel accurate pixel image style art? And somebody was like, no, it can't. And I, when I hear that, I'm like, okay, let's try. And I found out if I gave it like a

a 64 by 64 grid, and I said, go, go draw the art in there. It did. It just was able to put art into there. And that was amazing to see those kinds of results. And that's the promptability of this is insane. How do you plan for that? Does it just happen? You're like, oh wow, this is better understanding this?

A

People come to ImageShine with very vague prompts.

B

Yeah.

A

Make it better, make me look better, um, you know, make me cuter. All these things are really vague. And I think it's really the job of the model and the harness to distill that into actually what users want. And I think that's a personality of the model that we've trained over time that we've really harnessed the power for. Um, and honestly, I think it also yields a lot of really surprising results that people may not expect. And that surprise is just part of the fun of using Imogen.

B

I've seen like two kinds of prompting sort of emerge. And I I remember back with Dolly, I thought like, Oh, I'm a prompt engineer. I'll be great at this. Like I'll be really good at this. And I and I'm you know, I'd make a raccoon in space, I'd be like, feel proud. And then I'd see an artist.

somebody who wasn't a prompt engineer, somebody who actually came from that world and I'd watched them use their language and they were doing amazing things. Yeah. And that's that seems like is that still holding true.

A

Definitely. I mean, we work with a group of artists um very closely when we develop this model. And we're very inspired by artists, designers, marketers, all these different professions that I think have a different way of approaching their profession. And one of the things that was very important for us is we wanted to take the inspiration as well as the best practices for those professions.

and distill that into the way that people interact with the model. And so that's something we've deliberately tried to focus on. Um one hack that I've seen work really well is the ability to upload inspiration or context into the model. And the model has an incredible ability to take the spirit of that context and translate it into the output.

B

Well it it's it's interesting because I think that a lot of people worry that oh, I just push in a button, I get something beautiful and and Each each model that gets better. It's easier, as you said, to not have to put a lot of effort into it. But when people do put effort into it, they are getting even more amazing results. And it seems like actually the

If you're artistically inclined, you're getting even greater control because now it under like you said, it understands more about what you're talking about when you talk about depth of field and these other things or whatever you're trying to do. And as you mentioned It was exciting to see with earlier models artists who said, Oh, I gave it my originals and it gave me these variations and I know which one works. And just seeing that as this this real creative amplifier.

A

Yeah, definitely. I think Having creative direction or taste or judgment and bringing that to the model is the best way to push it further. I think one thing about this model that I'm really excited about is how it expands the creative outlet for people.

Um, I think the ability to create multiple different styles or types or variations has never been easier than with this image end model. And I think it's also understanding of Different contexts, like the way that it's able to shift what it's like to be generating an architectural diagram all the way to the aesthetic. a children's book. Um, the ability for it to move so seamlessly across these factors has been really awesome.

B

The ability to do great infographics and diagrams is very powerful. What kind of feedback have you been getting from people in research and education?

C

We actually have an internal alpha channel um where we test our models. Uh and in that there's like a sub channel dire dedicated specifically towards educators of any um level. elementary school students all the way up to graduate level. One of the coolest things I saw was there was a biology professor and he Put like these graduate level textbook rendering pages of things I had no clue about. And he said it was perfectly accurate.

A

I think the ability for this model to distill very complex topics into something that is really easy to understand within an image is one of its strongest capabilities. And we've seen this with students, with teachers who are using Imogen to learn different concepts.

to also help them create study guides, to help also create personalized content. I think personalized learning is a huge trend that we're very passionate about. And I think the ImageN model helps You as a teacher, create something that every kid can understand in their own language and own preference.

And that is something that we're really excited about. We're thinking about this in the context of also how do we bring more of the elements of ImageGen into Chat GPT at large so that when people are trying to learn concepts, we're teaching them with ImageGen.

B

I remember when I was in school and kind of prior to a lot of kind of multimedia blowing up, posters were a big thing, classroom posters explaining stuff. This really reminded me of how powerful an infographic can be because It allows you to bring as much attention as you want to it. Yeah. And you can spend the time looking at it and seeing it and you can put a lot more detail into it.

A

I think one really awesome visual shift that I've seen with Imogen is that now in internal presentations, over fifty percent of the slides are created with ImageGen. Wow. Um and that permeation of communication via images is so powerful when you're trying to explain your concepts or illustrate what you mean. And I think infographics and the text rendering capability, as well as the composition of the text on the page, is incredibly powerful with this model.

The model's understanding of not only what to say, but how to present it is a superpower. Um, and I'm really excited about future explorations of this. where we can think about how do we make this even better? How do we improve the composition, the different kinds of outputs, and also make it editable in the product. These are directions that we we're really excited about.

B

How do you see the progression of this? This is great, but typically anytime I talk to somebody opening eye about what they're working on, they're like, yeah, this is good. But

Creative agents and what comes next

A

I think we're still super early in exploring all the different use cases that people are really trying to push the model with. And so one of the things that we're really excited about is what is that next?

Images + Codex

Um stage for Imogen, which is to create the creative agent. Ultimately, the agent that can work alongside you, be your creative assistant. um and really understand how you work, what your preferences are, what is the output that you want to get to, and build the product and model ecosystem that helps users kind of have a personal interior designer, personal architect. um personal, you know, wedding planner, et cetera, all in one image.

B

I'll tell you another thing that was kind of amazing. It was like um all right, books. And so like every now and I have a book come out, I've got to change my social media headers. And I just went and I said, oh, find my book cover and write, you know, create a a post, you know, create an appropriate size social media header that I can put on X or Facebook or whatever. Like, let's say first shot, first shot, right aspect ratio, everything.

C

We basically did that from the start or trained the models to be good at that from the start. I remember like I worked on the initial D Risks of of ev basically it could do any Aspect ratio that you ask? Yeah.

A

Yeah. Yeah. You can now um really just easily specify the outcome that you want. Yeah. Like in the case of yourself. You're like, I want promotional material. I don't have an idea. I didn't specify exactly what I wanted. But the model was able to do the research and then give it to you in the style and aspect ratio that was relevant to you. And that's super powerful. We're already seeing this. Um you know, y you're you're an author.

I've talked to real estate agents who are using Imogen to help them create listings for their apartments or stage their listings. Um YouTube creators have talked to me about using Imogen for their thumbnails and promotional content. I've talked to top artists who want to use Imogen to connect with their fans.

And I think the ability for all different kinds of professions to start to use Image Gen to help them with visual creation is super powerful, especially if you're working in a visual and a creative industry. Image Gen is such a hack in your professional toolkit. I think it has to be a part of everyone's everyday workflow in the future.

B

This does feel like the I think it feels like the first time where anything I can reasonably come up with that does a pretty good job of it.

A

We think it's a new paradigm for any generation altogether. You know, we set this in the launch video if Dolly was a Stone Ages, Image N two point O is the Renaissance. Yeah. Um, and I think that is so true because the model, it's not only great artistically and aesthetically. But it also incorporates, you know, science, art, architecture, all in one image together. And I think that composition um and knowledge that the model has just means that the outputs are so much more trustworthy.

um are more powerful and enable so many more use cases. I think that Imogen and Codecs is also amazing intersection of the capabilities that we're setting out to create with both Imogen as well as coding agents. So many people are using Imogen as a first step to designing a new website or creating a new app.

And I think that intersection of having a really strong aesthetic model, which is image generation, in combination with strong coding abilities, means that now you're able to zero shot really amazing apps from scratch with both of these tools.

B

Yeah, I asked it in Codex. I said, I took my website, I said, could you make me, like I had the image in, could you create me some, you know, some different concepts for it? And it did these contact sheets. I asked for contact sheets to that, give me like four images there. And I said, Oh, the one on the upper right, can you go make that? And I watched Codex go make that, which was like, this feels like magic. And then uh they've implemented as part of pets.

Yeah. And so like if you're using codex and you say, Hey, I want to have like I have like I love ravens, so I have like a raven. I say, Can you make a raven? And then I watched it pull up the Image Gen tool and then iterate and make the sprites for it.

A

Yeah. Sprite sheets are going viral. Yeah. Um same with game design. People are loving using image design to help them create new worlds.

B

Any any hints on how to do better sprite sheets? Yeah.

C

I mean I I've tried to make you know GIFs internally and I think if I just use like the thinking mode or codex and you basically just ask it to generate one initial sprite, it's really good. And then you can just say, can you make the

A

The consistency across multi-images has been amazing. Um, we've seen a lot of people try creating 10-page comic books with consistent storylines. um, you know, multi-page slides. I think that consistency of characters and aesthetics is completely unique to this model.

B

That was an example too where there were a lot of workflows out there for working with image models that uh you had that were kind of janky, but you had to figure out how to do. And it's great now because I can do stuff where I can like create characters and say make a character sheet with the different poses and stuff and just go feed it back in and say, Okay, now doing this, now doing that, now doing that. And that's just such a it often

Uh sometimes what we need is obviously a smarter model, but like context length did so much for chat GPT, did so much for coding. And with an image model that's able to reliably reference these references is incredibly capable.

A

Yeah, for sure. And we're still trying to improve that as well. It's not perfect today. We're really trying to develop this visual creation layer for people because every single person you have an aesthetic or personal style or preference. And we're really trying to imbue that into the product that we're building so that people can get to the output that they're wanting easier and faster with Image N.

Prompt tips

B

Any any parting prompt tips for people?

A

Well, one of the things I would suggest people try is ImageGen thinking. So if you navigate to the thinking or pro models, we have a more powerful version of ImageGen in that experience. And in that model, you actually are able to search the web, analyze files, leverage tools under the hood, which then yields a better quality and higher composition photo. And the suggestion that I have for prompting that experience is

Be open-ended. I think the model will go and do the exploration itself to understand and try to reason um and find information that matters. And I also think giving it a sense of an aesthetic is also super helpful. Um using and grounding that in a style has been really fruitful for a great result.

C

I think just being very particular about the style or like what you like in general. Like for me I like minimalist infographics. Sometimes I think the model can be a little dense. And so I just Maybe I'm just a simplistic kind of guy. So I just like very thin very clean, a very clean look. So I like that.

B

Adele, Kenji, thank you very much.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android