Teaching AI to See: A Technical Deep-Dive on Vision Language Models with Will Hardman of Veratai - podcast episode cover

Teaching AI to See: A Technical Deep-Dive on Vision Language Models with Will Hardman of Veratai

Jan 03, 20253 hr 56 min
--:--
--:--
Listen in podcast apps:
Metacast
Spotify
Youtube
RSS

Summary

In this episode, Will Hardman and Nathan Labenz delve into vision language models (VLMs), covering their evolution, key architectures, and practical applications. They explore topics like vision transformers, CLIP model, Flamingo, and InternVL, discussing innovations such as cross-attention and dynamic high-resolution strategies. The episode also touches on benchmarks like MMMU and Blink, and concludes with an overview of frontier labs and the future of VLMs.

Episode description

In this episode of The Cognitive Revolution, Nathan hosts Will Hardman, founder of AI advisory firm Veritai, for a comprehensive technical survey of vision language models (VLMs). We explore the evolution of VLMs from early vision transformers to state-of-the-art architectures like InternVL and Llama3V, examining key innovations and architectural decisions. Join us for an in-depth discussion covering multimodality in AI systems, evaluation frameworks, and practical applications with one of the field's leading experts. Here's to the link to one of the most comprehensive reference documents for VLMs prepared by Will Hardman: https://dust-mailbox-c73.notion.site/Vision-Language-Models-11b675d75dd480af994cc474a754bb26 Help shape our show by taking our quick listener survey at https://bit.ly/TurpentinePulse SPONSORS: Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive 80,000 Hours: 80,000 Hours is dedicated to helping you find a fulfilling career that makes a difference. With nearly a decade of research, they offer in-depth material on AI risks, AI policy, and AI safety research. Explore their articles, career reviews, and a podcast featuring experts like Anthropic CEO Dario Amodei. Everything is free, including their Career Guide. Visit https://80000hours.org/cognitiverevolution to start making a meaningful impact today. CHAPTERS: (00:00:00) Teaser (00:00:55) About the Episode (00:05:45) Introduction (00:09:16) VLM Use Cases (00:13:47) Vision Transformers (Part 1) (00:17:48) Sponsors: Oracle Cloud Infrastructure (OCI) (00:19:00) Vision Transformers (Part 2) (00:24:58) OpenAI's CLIP Model (00:33:44) DeepMind's Flamingo (Part 1) (00:33:44) Sponsors: 80,000 Hours (00:35:17) DeepMind's Flamingo (Part 2) (00:48:29) Instruction Tuning with LAVA (01:09:25) MMMU Benchmark (01:14:42) Pre-training with QNVL (01:32:13) InternVL Model Series (01:52:33) Cross-Attention vs. Self-Attention (02:14:33) Hybrid Architectures (02:31:08) Early vs. Late Fusion (02:34:50) VQA and DocVQA Benchmarks (02:40:08) The Blink Benchmark (03:05:37) Generative Pre-training (03:15:26) Multimodal Generation (03:37:00) Frontier Labs & Benchmarks (03:47:45) Conclusion (03:53:28) Outro SOCIAL LINKS: Website: https://www.cognitiverevolution.ai Twitter (Podcast): https://x.com/cogrev_podcast Twitter (Nathan): https://x.com/labenz LinkedIn: https://www.linkedin.com/in/nathanlabenz/ Youtube: https://www.youtube.com/@CognitiveRevolutionPodcast Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431 Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Transcript

Is multimodal understanding in an AI important on the path towards AGI? It's not entirely clear that it is, but some people argue that it is. So one reason that one might want to research these things is to see if by integrating information from different modalities, you obtain another kind of transformational leap in the ability of a system to understand the world and to reason about it.

I would say, in inverted commas, similarly to the way we do. For open source researchers, the last few months have really seen the arrival of these huge interleaved data sets. the pre-training dataset size that's available. I'm kind of amazed that the perceivory sampler works because it feels to me just like tipping the image into a blender, pressing on.

And then somehow when it's finished training, the important features are retained and still there for you. Hello, happy new year, and welcome back to the cognitive revolution. Today I'm excited to share an in-depth technical survey covering just about everything you need to know about vision language models, and by extension, how multimodality in AI systems currently tends to work in general.

My guest, Will Hardman, is founder of AI advisory firm Veritai, and he's produced an exceptionally detailed overview of how VLMs have evolved. From early vision transformers to Clip's pioneering alignment work to today's state-of-the-art architectures like InternVL and Llama3V. We'll examine key architectural decisions like the choice and trade-offs between cross-attention and self-attention approaches.

techniques for handling high-resolution images and documents, and how evaluation frameworks like MMMU and Blink are revealing both the remarkable progress and the remaining limitations in these systems. Along the way, we dig deep into the technical innovations that have driven progress from Flamingo's Perceiver Resampler, which reduces the number of visual tokens to a fixed dimensionality for efficient cross-attention.

to intern VL's dynamic high-resolution strategy that segments images into 448x448 tiles while still maintaining global context. We also explore how different teams have approached instruction tuning, from LAVA's synthetic data generation to the multi-stage pre-training approach pioneered by the Chinese research team behind QuenVL.

Our hope is that this episode gives anyone who isn't already deep in the VLM literature a much better understanding of both how these models work and also how to apply them effectively in the context of application development. Will spent an estimated 40 hours preparing for this episode and his detailed outline, which is available in the show notes, is probably the most comprehensive reference we've ever shared on this feed.

While I have not worked personally with Will outside of the creation of this podcast, the technical depth and attention to detail that he demonstrated in what for him is an extracurricular project was truly outstanding. So if you're looking for AI advisory services and you want someone who truly understands the technology in depth on its own terms, I would definitely encourage you to check out Will and the team at Veritai.

Looking ahead, I would love to do more of these in-depth technical surveys, but I really need partners to make them great. There are so many crucial areas that deserve this kind of treatment, and I just don't have time to go as far in-depth as I'd need to to do them on my own.

A few topic areas that are of particular interest to me right now include first, recent advances in distributed training. These could democratize access to frontier model development, but also pose fundamental challenges to compute-based governance schemes. Next, what should we make of the recent progress from the Chinese AI ecosystem? Are they catching up by training on Western model outputs, or are they developing truly novel capabilities of their own?

There's not a strong consensus here, but there's arguably no question more important for U.S. policymakers as we enter 2025. I'm also really interested in biological inspirations for neural network architectures or any comparative analysis of human and artificial neural network characteristics. The episode that we did with AE Studio stands out as one of my favorites of 2024, and I would love to have a more comprehensive understanding of what we collectively know about this space.

I'm similarly interested in the state of the art when it comes to using language models as judge or otherwise evaluating model performance on tasks where there's no single right answer. This is a problem that we face daily at Waymark. and which seems likely to have important implications for how well reinforcement learning approaches will work and scale in hard-to-evaluate domain. Finally, for now, I would love to catch up on the latest advances in vector databases and RAG architecture.

I've honestly been somewhat disillusioned with embedding-based rag strategies recently, and I've been recommending Flash Everything as the default relevance filtering strategy for a while now. But I do wonder, what might I be missing?

In any case, the success of our previous survey-style episodes, including our AI Revolution in Biology episodes with Amelie Schreiber and our Data Data Everywhere Enough for AGI episode with Nick Gannon, suggest that people find these detailed overviews to be a helpful way to catch up on important AI subfields.

So if you have or are keen to develop deep expertise in an area that you think our audience would benefit from understanding better, please do reach out. I'm open-minded about possible topics and very interested to hear what you might propose. You can contact us, as always, via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. And of course, I'm more than happy to give you the chance to plug your product or services as part of your appearance on the show.

Now, I hope you enjoy my conversation with Will Hardman, AI advisor at Veritai, about all aspects of vision language model. Will Hartman, AI advisor at Veritai and AI scout on all things vision language models. Welcome to the Cognitive Revolution. Thanks, Nathan. Great to be here. Yeah, I'm excited about this. We've talked about this for a few months now and you have put a real Herculean labor into... a very deep dive into all of the techniques, data sets.

different strategies, variations on making vision language models work. I think this is going to be a really interesting crash course and overview on all that. And I think it's something that I know that I want and need. And I think a lot of people will really benefit from getting the sort of fast forward version of all the research you've done.

Thanks for putting all the legwork in upfront to make this happen. And basically what I want to do today is just kind of give you the floor, let you take us through everything that you have found to be important in vision language models. And I'll certainly have my questions along the way as we go. But yeah, I'm excited for this.

Cool. Yeah. So this would have been a lot easier to compile if the field had stayed still for five minutes and the leaderboards hadn't jiggled around every day and new papers hadn't come out every week, making me think, hmm, we should probably include those. But we're kind of at a checkpoint in time, so it's worth saying we're recording on December the 20th, 2024.

We're still not at the end of the opening eye 12 days of Christmas. So something may change tomorrow. Paper may be released tomorrow. So this is a kind of a point in time view of vision language models. And I guess it's, it's a deep dive if, you know, if you're coming from the perspective of someone who's interested in AI, but not super familiar with vision language models.

But if we're talking about vision and language more specifically, it's definitely not a deep, deep dive into the research because it's huge and there's so much going on and some of it's so complicated and there's so much we could cover. So what is all we could do is just stick to... A few things. Firstly, let's look at some of the most important architectures and some of the trends in research, and we'll illustrate these through some of the most notable models from the last couple of years.

And these will be models that anyone who's working or building this space is quite likely to encounter. And then we'll talk a little bit, I mean, touch briefly on the key data sets and some of the benchmarks. And one benchmark in particular we'll explore in a bit more depth because it's really interesting. And then I guess we'll talk a bit about recent attempts at what we call true multimodality. So a vision language model is really

reading kind of audio, sorry, images and text inputs, and then reasoning about them. But true multimodality would be generating images as well. So we'll kind of come onto that at the end. And then we'll finish up, I think. just by taking a kind of as of today snapshot what's best in class across some of the key benchmarks and where do i go and get it and what can i do with it

Sounds good. Yeah, I'm already taking away that you're not classifying me as a truly multimodal entity in as much as I can't produce image output. So the talk about the bar rising quickly, I think I'm already outclassed by what you're calling the true multimodal models. I mean, you can't draw. Not very well. Not well enough that I would not see an API anytime soon. That's for sure. Yeah. So in that case, I'm like you. I mean, I can just about doodle. That's about it.

Because let's just start off like all good research reviews with a motivation section, like why do we care? Because obviously there's lots of interesting use cases for the lens. It was really interesting recently when you had the team from Google who were talking about the new Gemini APIs. One of the things they said was loads of people are building with large language models. Relatively few are building right now with visual language models.

And that's, I think, going to be a gross area next year. So that's kind of cool. There's loads of use cases. The obvious ones are like medical assistance, being able to look at image modalities as well as patient history and then say things about the patient that might be useful to the clinician.

But I mean, other use cases, you know, content filtering, for example, and knowing what is in an image and text, for example, if you were looking at a social media platform, you're trying to screen out images or content of concern. indexing large quantities of archival material or product catalogs or something like that, where you've got both visual components and text components. You want to better understand what is this product given I can see it and given I've got some information about it.

But I've also seen applications, for example, in insurance where people are Having photos of cars, for example, there's a description of what's supposed to have happened to the car. It might be some damage. And the question is, can I actually see the damage in the image? Does it kind of reflect what the person said in the report? Yes or no? There's various use cases. But I think... Beyond that, there's kind of two other reasons we might be interested in vision language models.

One of them is that in building VLMs, what you're learning to do is integrate two modalities. They start off very separate, and somehow you're going to reason over both of them. And then if you can find the recipes for doing this right. then into the future you can think about, okay, can I integrate audio, touch data, LiDAR data, other modalities? And if you think about, for example, robotics of the future,

Just think about the number of different sensory modalities that you need to have a robot cook a meal. You know, it's got to handle everything, see everything. You can think about VLNs as being the first step to learning how to do this so that we can integrate lots more for the future. So that's kind of a longer term thing. And secondly, and it's a bit more of a philosophical question, is

Is multimodal understanding in an AI important on the path towards AGI? It's not entirely clear that it is, but some people argue that it is. So one reason that one might want to research these things is to see if by integrating information from different modalities, you obtain another kind of transformational leap in the ability of a system to understand the world and to reason about it.

I would say in inverted commas, similarly to the way we do. We know they don't do things the same way we do. There you go. And I guess there's arguments against, you know, is it important would be to say, well, look, frontier language models show lots of evidence of high-level abstraction, world models, sophisticated reasoning. There's no obvious ceiling in performance. as of today, maybe a grounded multimodal understanding of the world is not that important for achieving AGI.

But we'll kind of explore this. Maybe there are some little bits of evidence we'll come across today, which might point us towards one or the other view here. yeah i mean i would be very surprised if uh we end up, I mean, it just seems like imagine yourself unable to see, right? It's like, it would certainly be a major hurdle to have to get over. And my guess is that

And maybe we'll shine some light, so to speak, on this question as we go. But my guess is we'll never really answer that philosophical question of like, could we have built an AGI that isn't multimodal? Because, you know. to state the most obvious spoiler in the history of the world like there has been a lot of progress and it seems like If nothing else, it will be the path of least resistance. Like multimodal is clearly going to work. The details are to be unpacked, but.

Like maybe, you know, the sort of philosophical crowd will continue to say, well, we might have been able to do it without multimodality or it would have been impossible. But, you know, it seems like in the end. this is going to be the norm and these things are going to walk among us, probably, if I had to guess, sooner rather than later. It's always sooner rather than later, isn't it, in this world? So I suppose before we dive into the first

vision language model that we'll cover. There's probably two important little prefaces we ought to do. One, we'll just talk about vision transformers for a moment. And then we also talk about the clip model from OpenAI. And the reason is that both of these are going to crop up again and again. So let's just kind of refresh our memories as to what they are. And then we'll kind of dive into the VLMs themselves.

The vision transformer itself. So I'm going to assume we're all familiar with a language model transformer, especially the decoder architecture. the canonical paper here is called an image is worth 16 by 16 words which is from google about four years ago now 2020 i think And previous to this, most vision models have been based on convolutional neural networks. So they've basically been stacking convolutional filters to extract more and more global features from the images.

And the question that the Google team asked was, well, could we use the transformer recipe to build something that understands images? So the recipe is quite straightforward. You take an image and you divide it into non-overlapping patches like this. You then linearize the patches and you have a linear embedding which basically converts them all into tokens. So now we have just a sequence of visual tokens through a learned embedding.

And then we feed these patches one by one into a transformer encoder, and we use full attention across it. So every little image patch can pay attention to every other image patch in the image. This is very similar in thinking to how a model like BERT is trained, because what we do is they stick a classification token, they prepend it to the sequence.

And the training objective is, can you classify the image that you've seen into one of a large number of categories? You take the classification vector at the end, and that's what you use to figure out if you've got the right classification. So very, very simple recipe going on there. And the key finding was that if you make these things big enough, these vision transformers, the transformer architecture does beat the convolutional neural networks of the day.

And so that makes it a very useful building block. There's just a couple of things we ought to take away from the design of the Vision Transformer, which is also called a VIT. So I'll probably use the word VIT throughout. The first is that note the image resolution is going to be fixed by design. So in the original Vision Transformer, it was 224 pixels square. So everything has to be that size when we treat it in. Then we get a fixed number of patches.

coming out. For the original training, they would stick, like I said, this classification token in. But when we come to the vision language models later, normal practice is to take the entire sequence of hidden states from the transformer out. and use that as your encoded image. So we don't just take the classification vector, we take everything. And that means you could get quite a lot of vision tokens out of such a model. So if you started with 224 times 224 as your image size,

and your patches were 16 times 16, a bit of the back of the envelope math tells you you're going to get 196 visual tokens out at the end, which can be quite a lot. Okay, so that's the other thing tonight. And I guess the third thing is, and this is just a convenience, when we talk about vision transformers, you'll hear them describe like VIT H16, for example.

And this just tells you something about the dimensionality of the vision transformer. So the WIT is a vision transformer. The H in that stands for HUGE. We just have to know that HUGE stands for about 600 million parameters. And the 16 tells us the patch size. So that's what we're patching our images up into. So if I use that later on in here, you'll know what I mean. I'll say vit G16, for example. It's a giant one, which is even bigger than huge. There we go.

Hey, we'll continue our interview in a moment after a word from our sponsor. So a couple little follow-up points there just to make sure I understand this correctly. One, just to contrast the image. attention pattern versus the language model attention pattern. at least the one that we're most familiar with, which is like a look back only pattern, right? In language, the attention in the image context is generally all to all, right? So there's no, there's not like a sense of

ordering or the sort of, you know, obviously language unfolds token by token. So that's one kind of fundamental difference. It's just that this is more of a, which obviously reflects the modality itself, right? The image is a snap of a scene in time. And that is all on par with each other in the way that it's being processed all to all. The other thing that I wanted to dig in on just a little bit more is.

how the tokenization happens. In language, we have these tokenizers, which sort of try to figure out like what's the sort of optimal way to, and it's interesting that those are typically not.

part of the end to end training, right? You sort of have this just this separate sort of bolted on system that you know people have tried kind of hated and have tried to get rid of for a long time and maybe there's some signs that that could be about to happen some interesting i was just reading last night a paper from meta that's about a more dynamic way to batch text as opposed to with these fixed tokens that are predefined. But sure.

You know, if anybody hasn't seen this, you can go to the open AI tokenizer and just paste in your text and it will immediately chunk it into bits of text and color code them. And you can see what all the tokens are. So that's. a vocabulary of I think now up to like 100,000 different little bits of tech. text is broken down into before it is translated into numeric form and then processed and that translation of this token to this vector representation

is fixed. And there's basically as many people often refer to this as one hot encoding where there's as many possible input vectors as there are tokens. Yeah. that's a bit different now in this case right for images is my understanding if i'm understanding correctly is there's not like a fixed vocabulary size of That's of possible tokens, right? That's correct. So we're going to use the term throughout most of that. We're going to use the term tokens quite loosely because

As he correctly says, text tokens can be mapped back to text through a code book. You've literally got 80,000 codes. You look it up and you get your byte pair or whatever it is at the end. The same is not the case with visual tokens. They exist on a continuum, as you pointed out.

So to go from your little patch, which is really just a matrix, you've got a few dimensions and channels in there, you're simply going to pass that through a matrix, which is going to generate the vector that you want to stick in to the transformer. And it's learnable. That transformation is learnable. But the important thing is that your tokens are going to come out on a continuum. And they don't need to be quantized at this point. And there's nothing in the transformer architecture that says,

Tokens have to be quantized to a codebook. You can still run the attention mechanism, even if your tokens exist on a continuum like that. Okay, cool. Note that because we're training, like you said, the vision transformer with a classification objective, we don't actually have to decode anything at the end. So it doesn't matter. Then I'll save my next question I have for as we get a little deeper into the journey here. I think the last thing that's worth

just reflecting on for a second is just how small the images are that are being processed. So I've done a little bit of this. Not recently because these days we have these foundation models where I can just throw basically anything into it. I think there are some hard limits that I've run into occasionally if my image is like north of 10 megabytes or whatever. But typically just throw it in there. They handle it. You as a developer don't really have to worry about it.

with earlier generations of models you did have to do this sort of pre-processing where you would take your image it was your responsibility as a user of the model that somebody open sourced for your convenience to take your image and basically morph it or you know i guess resize it is probably the right term into the required size that

you know presumably just because everything was smaller back then and you know compute resources were more limited and the results weren't so spectacular in general 224 by 224 even until fairly recently and even with like the open ai kind of small mode, whatever their sort of low res mode is. It is remarkable how much performance can come out of these like very small images, even when they're dramatically shrunk and they're often like.

significantly distorted because your original image might not have even been a square but you're just kind of like whatever i'm just gonna make it a square i don't care if things are smushed i don't care whatever happens that's what we're going to work with and it's amazing to me how well that actually

works. And I think these days that's getting liberalized for sure, because it's not all low res on like the OpenAI API, but it is remarkable how far that can go. Yeah. Without wanting to spoil the big reveal. Of course, they're not compressing everything to 224 by 224 and using that as the image inputs. There's some much more sophisticated and smarter things going on, at least with the leading visual language models. So we'll see how they do it in a bit.

Okay, cool. Well let's carry on. That's a great start. Cool. So that was the vision transformer. And I think the other thing that we should introduce right now is the clip model from OpenAI, because that's, again, going to be really fundamental. And the paper here is from back in 2021, and they called it Learning Transferable Visual Models for Natural Language Supervision. And CLIP itself stands for Contrastive Language Image Free Training.

So it's a canonical model in the field, and it's a really nice introductory one to study for how we align image and text encodings. So the idea is you start with a vision encoder, which could be a vision transformer. and a text encoder, which could be an encoder-only transformer. And you've got a large dataset of images with their captions that have been scraped in the web.

The process is to jointly train both encoders so that they're generating embedding vectors both for the text and for the images. such that if you take an image in its caption, the two vectors are going to have very high cosine similarity for an image in its caption. But if I take two random captions in the image, they will have low similarity.

And the way this is done is you simply pass the image through the vision transformer, the text through the text transformer. You then add just a linear projection to get them kind of the same conforming dimensionality. and then use this contrastive loss function. And what that does is it says, suppose I've got a batch with n image caption pairs in it.

I know within that that I've got n true pairs and I've got n squared minus n bad pairs. So you set the loss function up to say, penalize any dissimilarity between my true pairs. and likewise penalise any similarity between the non-pairs within the VAT. And that's a contrastive loss function. It's basically bringing things that are supposed to be the same close together and pushing the vector representations apart for things which are not the same.

And one nice thing about this is that once you've trained it, I mean, obviously you can use Clip itself for things like Image Search. but you can also just take the trained vision transformer out of it and use it downstream. And that's really nice because what you've done is you've already in some senses kind of

trained it to embed things similarly to a language model. So therefore, if we were going to put that into a vision language model, it should only be a small step away from being aligned to whatever we're going to use in that model. So there's the thinking there. And that is why, as we kind of go through to the rest of today, we're going to see that very often the researchers start with a clip. It gets them kind of two thirds of the way there.

So I wasn't going to do a deep dive into the clip model, but just to say that, you know, it is a vision transformer. That's how it's trained using this contrastive loss objective. And we can now use it downstream.

Yeah, I remember using this one in the early days of Waymark small business video creation. We were at the point where we could get the fine-tuned gbt3 to write a somewhat decent script you know certainly things have improved a lot since then but then you have the challenge of okay now that we have this narrative that we want to put together for this random small business and we've also found all these images off the web um what

choose out of this bag of images right and at the time you know you had sort of prior to clip you had like classic you know most of this stuff was done on finite sort of pre-established data sets right i mean one of the big advances of clip was that they moved from your like image net type thing where you have a certain canonical set of images with a certain fixed set of classifications And the game used to be about

How, you know, comparing architectures basically, right? Can I come up with an architecture that on this standard dataset does a better job than everybody who came before me in some way, shape or form there, you know, where I can claim to have state of the art and I'm great. That didn't do much at all for us with an application like Waymark because...

It was, you know, maybe could have made it work, but it would have been very noisy because it would have been like, well, the image that I have of this business, whatever it may be. potentially is not even well represented at all by any of the classes in an image net set of classes i think it's a thousand classes that they have in image net and there were smaller ones before that that had fewer classes so This was the first moment, as far as I know, where they went.

Let's forget competing on these standard data sets. What people really want is to understand anything that they might be looking at. And the web scale data is out there and enough of these images are captioned. And by the way, when you really got into clip. tons of noise in the caption data. Typically, if you just ask yourself, how do people caption images?

It's lots of different ways, sometimes with jokes, sometimes with just a straightforward description of what's in it, sometimes with a line from a poem you know i mean it's a tremendous amount of noise in that original data set we found in using it that it was pretty good If you said, okay, I want an image of this, which of the images from this set of possible images most closely matches that query, we could get a pretty effective sort.

But we'd see all sorts of artifacts and weirdnesses. Like sometimes if the tech. that we were, you know, let's say we were, for example, doing a video for a pizza restaurant. There's a pizza restaurant in Detroit that I've always kind of used as one of my go-tos. If you put the word pizza in... text query and run that through the text side and then

put all the images through the vision side and you know each one now is represented by the vector and as you said like if they're similar you should have a high cosine similarity so basically literally multiply these vectors together take the sum and sort by that that was the the procedure. But sometimes what you would find is if the word pizza in text was in the image, then that would pop to the top. And so you'd have all these sort of weird things to deal with.

I think largely as a reflection of the fact that this was just extremely noisy web scale data that at the time, they didn't really even necessarily have a great way to clean, right? Because it's like... All this technology is kind of being created out of nothing. You know, these days you would say, well, why don't you filter those, you know, images that have caption image content mismatch or that seem nonsensical.

And I think the answer at the time was basically, we have no way to do that. So we just have to try to throw as much data into this as we can and hope that some signal comes out. And it did. We did find also that. I promise I won't do this every single step, but I have memories of clip. basically totally unrepresented. And I guess this would be because a few of the image captions found online are like a beautiful picture of X or, you know, an unattractive picture of X.

But we wanted that because small businesses have very wide ranging quality of pictures. Sometimes you get user generated content stuff posted to their Facebook. Sometimes it's professional. The difference matters a ton, right? They do not want some of these ugly images that might be associated with them online in their marketing. But how could you tell? There was no real. aesthetic signal in clip. It was all content, but not quality.

So that's crazy. That wasn't all that long ago. No, we're still three years, four years ago. And now we're going to jump to two years ago. And then everything else we talk about is going to be in the last two years. And you're right that one of the stories that's going to unfold is we kind of cover a few of the VLMs today.

is this increasing obsession with filtering data for quality for both the pre-training and the subsequent by tuning stages. And that is to kind of get rid of this problem of noisy data, which really does seem to hurt BLMs in particular. Cool. Well, I think that's enough memory lane down the cliff alley for me. So let's keep going. Hey, we'll continue our interview in a moment after a word from our sponsor. Okay, so we're going to jump forwards then to 2022.

A model that I've heard described as the GPT-3 moment for vision language models, and that is DeepMind's Flamingo model. a number of really interesting innovations in this. And so it's worth covering this in a bit more depth. And also the first example we'll see of actually how the LM is constructed. Okay. So the basic pattern, and I'm going to see this through all the models we cover, is that you're going to encode the two modalities, that is, text and images separately.

So you can use a text tokenizer, and you're going to use an image encoder, normally a vision transformer. We're then going to select a language model, which is called the backbone. And the backbone is going to be the thing that does all the reasoning of both the text and the images. And that just leaves us with a question of how we connect the two.

So for example, in the Flamingo model from DeepMind, they actually looked at a vision transformer and a convolutional neural network. Everything I've looked at since has been just using a vision transformer. And then they used a Chinchilla language model. Okay, so to connect the two things together, it decided to freeze the language model and then introduce cross-attention layers sandwiched between, I think, every fourth transformer block in the traditional language model.

Okay, so the idea of the cross attention layers is that they're going to look up information that we got from the vision transformer. And there's immediately on thinking about it, you might see a couple of challenges that need to be resolved for this to work. So the first is if we've got an input say that's got more than one image in it. then obviously when we've encoded those, we're going to have a variable number of image tokens generated.

And if we're using a cross-attention mechanism, so it's similar to self-attention other than the keys and the values are not coming from the text with the coding, they've got to come from the image. then that's a problem because obviously the dimensionality of the cross encoder mechanism is fixed. So we need a fixed number of visual tokens generated. So what if we got two, three, four images, or we've got different sizes of images?

And the second question is, we can sometimes get a lot of visual tokens. So if we want to train really efficiently... Is there a way to reduce the number of visual tokens that we're actually going to attend to? That will just make trailing the model a lot easier. Cross-attention layers don't need to be quite too deep.

And so the way that the DeepMind team solved this is, I think, very smart. And they used something called the Persever Resampler, which might actually have been their innovation, I think. So Persever Resampler is like a separate model that's going to be introduced. So not all cross-attention layers, we're introducing a separate fresh model here. It's going to look at the visual tip.

And it's designed to select out or re-sample the most important visual information that the vision transformer has encoded. But you want to do this in such a way that if you have a very, very, very long sequence of vision tokens that have come from one or more images, What you don't want to do within your perceiver resampler, which has its own attention mechanism, what you don't want to do is compute the kind of all-to-all sequence times sequence attention matrix, because that could be very big.

So the innovation here is that in a normal attention mechanism, you've got your queries, keys, and values, right? And they all come from the context that you're decoding, right? So the idea here is that the explosion comes from the fact that the queries need to be multiplied by the keys at some point, and that creates this all-to-all tension matrix, which is the size of the sequence length squared. And it is, of course, the bane of transformers.

So what they did in the Perceiver model is they said, well, what if we jettison this need to come up with queries based on context that we've just read? What if we instead have like a small number of fixed queries? And when I say fixed, I mean learnable. So they're just latent factors that can be learned at training time. They selected this to be 64 query vectors, which can be learned at training time.

you're going to look at all of the visual tokens that have come in. Your query is going to be a size 64, so the actual query key matrix that you calculate is now sequence length times 64, which is way, way, way smaller. The beauty of that is once you've finished the attention calculation in the perceivory sampler, you've got something that

64 times whatever your hidden dimension is. So it could be 7, 6, 8, for example. So again, you get something very small out of it. And the perceived resampler model is a module. It's essentially this, and it does it in a number of stages. But it's essentially just using these learnable keys. And the nice thing is, at the end, we know the size that the visual tokens are going to be at the end. 64 times 768, for example.

which means we can now define across the tension layer, which is always going to read visual tokens of that size that have come out of the Vision Transformer. Is that making sense?

Yeah, and this happens after the initial vision transformer layers, right? So we've got still the all-to-all attention happening on the vision side of the... I can remember the chinchilla diagram and even more so I can remember the bowl of yarn soup that I went around showing to people at parties in spring of 2022 and into summer when I was like,

look at this. This exists now. So you've got basically, again, very similar to Clip, and this will be a pretty common theme, although there are some exceptions too, but...

Images being processed through one kind of main model, text being processed through another main model, these things are often... totally frozen at the time that we now want to figure out a way to fuse them and the way to fuse them is two parts here one is across the tension but then Figuring out how to essentially make life easier for the cross attention and the kind of main language model that's going to carry things on by finding a way to.

standardize the sequence length for the image so toward the end of the image processing with its full attention, then there's this sort of adapter that says, okay, regardless of how big the image was or how many tokens, whatever.

it's always going to be output from this step at 64 tokens. And therefore, the cross attention can always be the same size and everything downstream of that can get simpler. I think this is a really... interesting thing i mean i i probably didn't have the sophistication at the time to appreciate it but it's a good indicator of just how malleable all of these latent spaces are and how

I sometimes say like everything is isomorphic to everything else, which I think probably doesn't actually mean anything, but it's, you know, the intuition that I'm trying to express there is just like, I've seen so many of these things at this point where one space is bridged over to another space or reformed. You almost see this even just with the libraries that are used.

I'm thinking of all the Python libraries that allow you to reshape matrices. It's almost like those matrix reshapings are... You don't think of those as semantic, but... when you scale them up to this sort of thing

a very similar thing starts to happen where you're like, okay, I've got all these, you know, all to all, but what I really need is just a sequence and it needs to be fixed length. And okay, I'm just going to train a thing that no matter what, it's going to output a fixed length sequence and hope it works.

I mean, not everything works. I'm told by the researchers that there are many failures along the way, but it sure seems like almost everything of this sort kind of works. And I think that's a really. striking reality. I remember seeing this one, and there was Bloom 2, which was another one that really brought it home for me just because of how few

new parameters had to be trained. I don't know if you have those numbers like right off the top here, but it's a strong contrast between the amount of parameters and pre-training on both the text and image side versus how small these connectors can be in many cases Yeah. So one of the teams from Hugging Face did a comparison and we'll talk about it in a bit more detail later.

They said in a 7 billion parameter language model class. If you add the cross-attention layers in, that's about 25% of the parameters now need training. Whereas they said if you're going to do it the other way, which is just a simple projection, that's about 10%. So you do introduce a lot more parameters using this kind of cross-attention mechanism that we're talking about. but still it's small compared to retraining the whole language model.

And that's actually one of its benefits is that you can literally freeze the vision transfer if you want. You can freeze the language model and just train the cross-attention parameters and the perceiver resampler. And like you, I'm kind of amazed that the perceivory sampler works because it feels to me just like tipping the image into a blender.

pressing on, and then somehow when it's finished training, the important features are retained and still there for you. It feels to me like they would have all been mixed up. I guess it goes to show intuition, at least for me, doesn't really work when it comes to these very sophisticated mechanisms.

Cool. So let's just say a little bit more then about, so we've got the perceived resampler, we've got the language rule, and then to train them, you simply need to switch on your next token prediction training objective. and start training the newly nationalized layers in the language model. And that's not too difficult to do.

What's happening is you're replacing where your images would have been in the text prompts that you're giving to the language world during this training. That's prompting it to look up. the outputs from the perceivory sampler. So these are not tokens that get scored, for example, when you're decoding.

It's prompting it saying, when I see this token, it's okay, I need to go and look up something in the perceivory sampler, and that's going to give me enough context to know what the next token should be, because it's where the image would have been in my input tag. And in terms of the training data sets, like to see how this actually works in practice.

They obviously used images paired with their alt texts, which is the same that's used in Clip, but that's not enough. And one of the contributions that the Flamingo team made was to realize the real importance of what's called interleaved data. So interlead data is basically data scraped from the web, like HTML documents.

where you've got images and you've got text in the document. And they know by looking at the document object model when they scrape the website, they know roughly what order these things are supposed to appear in. That interlead dataset that you can then pass through your decoza and then you can say, you know, when you get to an image, look up the image, you know, now I'm going to keep on producing the text.

they found that it was really, really important for actually maintaining all of the kind of downstream performance metrics. So interleaved data turns out to be super, super important for vision language models. That was one of the major findings of the Schlemingo paper. So in terms of how they evaluated listening website once they'd done the training, it was visual question answering and OCR and captioning. Those are kind of the key tasks they looked at.

And they discovered, firstly, it works quite well in a few-shot setting. So the train was very compute efficient because you don't need to touch anything in the language model and the newly initialized parameters. And yet the output was still very competitive with very much more focused task specific model.

So they had a model that could do several things and it could be competitive with all the task specific ones. And that's why it was a really kind of, I think, foundational model in visual vision language model. is because he had this kind of several tasks he could perform and he had this one training recipe strongly based on the transformer training recipe.

And the basic pattern of using cross-attention layers has been used by other teams since, but it's not the only way of doing it. And we'll come on to the other way in just a second. they never released flamingo to the public did they i don't remember ever having a chance to actually use it. I don't think they did. Unlike what we'll come on to in a second and most of what we'll cover today. Yeah, that was before Google was dancing, I guess.

So that's the basic training recipe we just covered for a cross-attention model. One of the things that makes large language models so good at what they do is the instruction tuning. And that's proved to be one of the big unlocks. And in the flamingo recipe there, it's not kind of clear, you know, there was no specific instruction tuning step. And the difficulty there was, it's just really difficult to come up with an instruction tuning dataset.

right? There weren't such things around. There are now. And that kind of this How to do this is the contribution of the next model that we're going to look at, which is the NAVA model, which stands for the Knowledge, Language and Vision Assistant. And this is a 2023 vintage model. So we've already jumped on like another year.

The original lava model comes from a mixed team from some people from Microsoft and then from academic institutions. And it's the first in kind of a long series of lava models, all of which are kind of based on the same. The big innovation here is instruction tuning and how they built the instruction tuning dataset. They start with a key observation that the generative vision language models that existed at the time they built this

could only follow a relatively limited range of user instructions. So they could do some captioning for you. They can answer some basic questions about images. But they couldn't do anything like the range of tasks that a language model could do once it had been instruction tuned. And one of the reasons they put this down to was just looking at the big data sets that were used to change laws. You have the interleaved text and images from a crawled web document.

And you also had these big captioning datasets. We'll talk about a couple of them in a second. But relatively few of them were actually containing lots of task-oriented labels for the images. And that was just missing. So the question is, how can you build such a thing that would work for a vision language model?

You should probably talk a little bit about the basic architecture of Lava because it actually looks different to the Flamingo model. We've talked about that Flamingo has described it as being a cross encoder model because they've introduced these new cross encoder layers. The Lava team chose actually a simpler approach, and this is pioneered, I think, by the Salesforce team that built Blit. So we can call this the auto-aggressive architecture rather than the cross-attention architecture.

And the idea is that you're going to take your vision tokens, which have been processed by a vision transformer, even better if it's from a clip vision transformer, because that's already been aligned in the past to a language model. And then you're going to train a simple projection matrix and inject the tokens directly into the decoder stream. Okay, so no cross attention needed here.

you have a simple projection matrix, and then you're going to mix them in. But then in the original lava architecture, they kind of prepended. all of the text tokens that we're going into the language model backbone with all of the vision tokens. So your kind of training sequences will be a bunch of vision tokens, text, and you then be attending to everything to the left.

So that's the autoregressive architecture. It seems a lot simpler than the cross-attention one, but it's got maybe one or two downsides to it. The first is that if we generate a long sequence of vision tokens from the vision transformer, we've literally now got to unroll all of them in the decoder. So they're all going to be part of the attention mechanism in the decoder.

so that's kind of one disadvantage The second is that if we've got this projection matrix, as was used, and I think they used a simple linear layer in the LASA model. That gives you many fewer parameters in introducing a big cross-attention mechanism.

So you've now got to learn all the alignment using those parameters. And if you want to do anything further, if that's not enough for you, you've got to unfreeze the language modern backbone and you've got to start mucking around with the attention mechanism.

So one of the downsides there, as we know, is we start fine-tuning the language model and the attention mechanism. It's very easy to kind of suffer catastrophic forgetting on some of the tasks that the language model was fine-tuned for in the first place. There's one of the downsides of using this auto-regressive architecture that just uses self-attention. But that's what they chose to do with lava.

yeah you mentioned blip we did it one of my first episodes actually of the podcast was with with the authors of blip two and i remember this sort of you know, moment of understanding where I was like, wow. So they are somehow converting these images into text embedding space. And treating the images as if they were text from that point on, the rest of the model doesn't even know that it has been fed anything other than.

text because you know it's it's frozen right so i mean it's been trained to handle text it can only handle text But now somebody's figured out how to represent these images as text in the text embedding space. And one of the really interesting things about that was how you are accessing parts of.

text embedding space that text itself can never get to right it's similar a little bit we talked about earlier where the you have the one hot token level text encoding but these image projections into text embedding space are not bound by that at least they weren't in the blip one i don't think they would have been in this lava one either and so you're just accessing you realize that like the space of possible inputs

is actually hugely bigger than the space that the actual 100,000 token vocabulary routinely accesses. And... That again, just kind of magically seems to work. It's like one of these, you know, divine benevolence things where you're telling me we trained an entire language model on nothing but this hundred thousand tokens. And now it's going to be just. that layer and just directly go into embedding space with whatever is learned by this projection process.

And the language model is still going to handle that normally? And the answer is yes. It's kind of amazing. It's quite an amazing result. I mean, it's a simple projection matrix. taking takes your vision tokens lets you mix them in i mean and in other models arbitrarily mix them in with the text tokens and

It doesn't destroy the performance of the transformer. It doesn't seem to need complete retraining of the attention mechanism. That simple projection matrix is enough to get it to work with the visual tokens, which I think is quite an amazing finding. Yeah, no doubt. I don't like to do analogies all that much, but I'm just I'm trying to sort of imagine like what. Is there any sort of analogous challenge that we could pose to a human? I mean, we're obviously natively multimodal, but.

to think about putting something into word space that is not words. It's like kind of going directly into thought space, I guess. And yeah, there were a few of these moments. For me, I think one of the reasons that I got so obsessed as early as I did, not having had an academic background in the subject, was seeing how... The same ish architecture was working across modalities and then starting to see these bridges where it was like, boy.

If you can just take two frozen things and kind of mash them up in a couple different ways and they all kind of seem to work, then that means we are headed for a world where all of this is going to be integrated and it's going to be... quite something to see when that happens so i these were for me like very leading indicator moments of just how much was almost

certainly going to be possible. It was like if the language model could do that and not break, then I think, you know, we've got to expect that there's going to be just a lot more Frankensteining to come, much of which might be weird in all kinds of ways, but.

Yeah, absolutely. So we said we'd talk a little bit about how they did the instruction tuning with the lava model. And this is, I guess, what the main contribution of the paper was. We've talked about what the autoregressive architecture looks like. So the way they did this is actually really, really smart. So they started with images from a dataset called Cocoa, which was produced by Microsoft, I believe, back in 2014.

And the idea behind that, there's about 200,000 images in the Coco dataset. You've got images. And then you've got descriptions with bounding boxes in them, describing what's in different areas in the image. And so that actually appears in the text. It says in this area, region, and then this thing, region, this thing. And the idea is that it's to teach visual grounding for vision models.

So what the Lava team did, which is very smart, is they used a strong model. I think they used GPT-4, not a vision model, just a language model. and a very careful and clever set of P-Sharp prompting templates. And they asked the GPT-4, can you generate a conversation between a questioner and a vision assistant? framed as though the assistant can see the image, although GPT-4 cannot see the image.

So, for example, given the description of an image, which might say, here's a bunch of people standing around the vehicle, there's luggage on the floor, and then some bounding boxes, here's a piece of luggage. here's a bicycle sitting at the side, here's a person in the bounding boxes. So read that GPT-4 and now come up with a question for me, as though you could see the image.

So the question might be, what type of vehicle is in the image? And that's easy for GPT-4 to do because the image the caption will say what kind of vehicle it's in. But then you can think of more intelligent questions. For example, what is the thing to the left of the car? And we know from the bounding box... that there's a thing to the left of the car is a bicycle. So not even needing to see the image, you can come up with that question just by simply computing where the bounding box is.

Or you could ask a more in-depth reasoning question like, what challenges do the people around the car face? so people are loading luggage into the car. So if you just saw the image, you would have to infer all this stuff from the image. So the really smart thing here is getting these kind of conversational dialogues built up. by a strong model like GPT-4.

And then you've actually got all of this data you can use for instruction tuning. So you're going to ask the questions and you've also got the model answers because GPT-Form knows what the model answer should be. Remember, it came up with a question. and the answer using all the information it could get from the descriptions and from the bounding boxes.

And that's the kind of the genius here is that when they built the NALA model, they froze the Vision Transformer and then they updated the language model they were using and obviously the projection matrix, which is going to align the two. And they only did the pre-training on 600,000 pairs from a captioned images dataset.

And they used a very kind of simple form of their instruction question answering. So it was something like, here's a question that GPT-4 has come up with about the image based on the caption. Then you show the image. And then what you want a language model to do is finish with the answer, which would be the answer to the question that came up. And then for the fine tuning, that's when they use this really kind of complicated.

more sophisticated instruction tuning dataset they came up with, which includes multi-term conversations, questions about the images, asking what's in particular regions of the images, asking these reasoning questions again. And they were able to generate 150,000 of these things using this process. And so they used that as the instruction tuning dataset.

And when they evaluated it at the end, they found that it was outperforming all of the other vision language models at the time, complex reasoning tasks over images. And they found it was also a slightly better conversational task than anything else. So what they showed was that putting all this effort into being really smart about generating augmentations for the instruction tuning dataset.

yields a much smarter language model at the end that could complete a much wider variety of downstream tasks. And so I think that was a really smart innovation. And in the subsequent LARDA models, the same kind of recipe for generating structure and tuning data has been followed, although the models have got a lot more sophisticated. And the latest one, which is, I think, for a team of ByteDance, is called Lava One Vision.

And that's sitting really, really well on the MMNU leaderboard, which we'll talk about in a moment. So it shows the recipe is still strong a year and a half later, and the class of models is really competitive. Yeah, it gives you a sense too for why the leading developers seem to think they'll be able to get around whatever data walls may be naturally occurring because These results were achieved with largely synthetic data, at least when it comes to the final margin of

actually getting the thing to behave and like be a useful, you know, image assistant. So yeah. This next one isn't so interesting. I don't know. Maybe we'll skip it. But at the time in summer 2022, I was fine tuning. the text DaVinci 002 series of models. And they never actually released the fine tuning there. I think it wasn't very efficient. And they had other things they wanted to do with their resources. And GPT-4 was cooking, although I didn't know that at the time.

But we had this kind of preview access to that thing. And I was using it to... process images in a similar way. One challenge I remember we had was we wanted to be able to take a video. and look at or try to answer the question, how much, well, first of all, what text is on the screen, but also how long is that text on the screen? And for an arbitrary video, you could like.

take a still out of it and use OCR. And that's what we did. But then it was like, okay, well now I've got 900 frames in a 30 second video. What is the right frequency to take these stills? And then OCR all of them and then come through and have the language model process. That data was so gnarly. But that was one of the first things where I had this experience of feeling like, geez. I could go try to get human annotators to do this. and create some data set and train on that. But actually,

What I'm finding myself drawn to do, and I think this is like actually still a pretty winning recipe these days for many things, is 10. I was like, I'll do 10 myself. Then we'll fine tune on that. We'll have the language model do the next hundred.

And then, you know, hopefully most of those will be right. The ones that are wrong will correct and will fine tune again. And in that way, kind of bootstrapping into something that could do this like very gnarly task that humans were not very good at, you know. Very unfamiliar data, very sort of not. Not the kind of thing we're evolved to really handle.

You could do it if you really stapled your pants to the chair, but it was tough. And it would have been really tough to hire out with any sort of quality. It was even tough for me to sit there and do the 10 or to inspect the next 100. That was definitely a moment where I was like, this is going to be.

transformative because now we're hitting the point where with the tools available, I can actually bootstrap my way into a pretty fundamentally new capability that And I can do it in a way that's like way faster and way more affordable than having to go out and like actually hire human annotators. And that was just like, if I'm figuring this out, I bet a lot of people are figuring this out and there's going to be a lot of new.

in my case was narrow. I wasn't trying to create a general purpose thing. But even then it was like, it's so you know, if I can create a new capability in a few days of work as like a hybrid programmer annotator bootstrap my way into this sort of thing if i can do that in just a couple days like that's how long it would take me to like set up probably probably would have taken longer to set up like decent infrastructure to

collect the data and figure out where what platform are we going to use to have the indentations done and who do we trust to do this and how do we like validate that they're actually doing it versus you know just filling in whatever and not caring

And it was like, boy, I can bootstrap my way into that with just a day or two's worth of work. In many cases, once I kind of got the hang of it, it was another one of these moments where it's like, yeah, this is going to happen across a very, very wide range of tasks. Yeah, and I think this idea of using AI models to generate your synthetic data for training the next generation.

We see it all over the place with the vision language models. So there are examples where generating synthetic OCR data, for example, in order to train them to be able to read corrupted images and know what the text is. It's kind of a classic one. I've seen generating latex documents as well. So you generate the latex document, and so you know the text you started with, it's effectively generated the latex, and now you've got a pair.

trying to grow the data sets. And this is always very difficult, especially in the realm of vision language models, because you can't just deal with a single modality of data, you know, there's got to be a correspondence.

between the vision components and the language components and the correspondence has got to be good. Like, as you said, back at the beginning, when we were talking about click, it's like when there was a lot of noise in the data, it was only, there's only so far you can go with the quality of the model at the end.

So one of the challenges and something that lots of the teams spend their time doing is thinking about how can I create much larger, really high quality pre-training datasets where I kind of know I've got good correspondence between the visual things that I want the model to learn and the language. So I think now, you know, the lava recipe, and by the way, you can get the instructions from hugging face. You can see exactly how it's built. It's there.

I think it's a really smart way, a really creative way to think about how you build out a really sophisticated data set. And then it has this dramatic effect. when you actually use it for instruction training on the capability of the model to act as an assistant. You can do all of these other things that previous vision language models couldn't do.

Well, we're nowhere near the end. So I didn't expect this to be such a, you know, nostalgia fest. That'll be less because I think as we get closer to the present day. Honestly, just the amount of time I've spent wrangling these like weird idiosyncrasies has dropped.

pretty dramatically as the foundation models have gotten better for an application developer like me. I've been very content to leave these sorts of weird things in the past and uh we're you know in the next couple of models that you're gonna get to through this history we start to hit the point where it's like It's just starting to work. So let's keep going and I'll probably be doing less of this. I remember when stuff as we get to the present day. Time flies, that's for sure.

So I just mentioned, like at the end, I said that, you know, the latest Lava model was called Lava One Vision. And I said it ranks very creditably on this MMAU benchmark. So we should probably say a word or two about what that is. So MMMU, it stands for Massive Multidiscipline Multimodal Understanding, which is something of a mouthful.

And I guess the easiest way to think about it is it's like the multimodal version of NMLU. So I think it's probably the most interesting and relevant vision language model benchmark for just understanding how smart your VLM is and how much reasoning it can do. So it's designed to measure explicitly three skills. So perception, what can it see in the image? Knowledge, what does it know about what the image is showing? And reasoning, so what can it infer from the image?

And the way it was compiled is by a bunch of students, of course, from various disciplines and subjects at university, drawing questions from online sources, from textbooks, and even from lecture material. And the idea is that each question that they find has to require an expert level understanding from the domain in question. You need to have that to answer the question.

So they built up about 11,000 of these questions and they ranged over about 30 different subjects. So that includes things like history, medicine, electronics, market research, music. And some of them also require sort of mathematical reasoning to solve, for example, some of the problems involved Fourier transforms.

Each question says something like, you know, here's an image showing harmonic intervals in the musical score, which one of these, and then you've got like four musical scores and this, which one of these is constructed incorrectly. And then it's going to give four options. So it always comes in fours in MLU. And so the language model has just got to select A, B, C, or D that's incorrect. So it's got the multi-choice question.

a lot required of the models to answer them. And when they released the benchmark in November 2023, the top scoring model was GPT-4V. And that got about 55% on the benchmark. So like random guessing would get you 25% on this. The top open source model was one of the Lava models, like we just discussed, the second model in the Lava series, and that got 34%. One interesting development since then I guess is that

01, as you might not be surprised to know, is now topping the leaderboard. So in just over a year, the leader score has jumped to 78% at the moment. So O1 is sitting a full eight points clear of the runner up on this. But you can see that benchmark is, I wouldn't say crushed yet, but a lot has been done on it in the last kind of 18 months. By the time we get done with this recording, we might find a new leader at the top.

That is the risk because it's actually 01 preview that is right at the top there. We haven't got, you know, who knows what's been done to 01 preview since it was preview and now it's 01. But yeah.

it could be beaten also we don't have um scores for uh gemini 2.0 um and we don't have scores for the new claude model i think actually you might revise this because i think i might have some further down actually page right now i don't know how often they are updating it they do have a one they do have a three five sonnet not clear if this is three five sonnet original or three five sonnet new original three five sonnet yeah

I actually think I've managed to find all the scores for those models, actually. So I can tell you what they are later on. But anyway, yeah, O1 is standing top. So what's interesting about that is when the bench trial was first released, the team who produced it, actually looked at asking text only GPT-4. to answer all of the questions. What they would do is they'd extract any text they needed from the images via OCR, or they'd get the lava model to caption the images.

And they basically give a text-only model, you know, the captioning that was extracted from the image and the question and ask how it did. And they report a score from GPT-4 of 34% on the benchmark. And that, I guess, just highlights the fact it's above 25% highlights the important role that reasoning will play in answering the questions. So reasoning is really important.

And the rest of the over 34% and up to whatever the models get, you can put down to their smart interpretation of the individual tokens and their ability to reason over them. So just a quick introduction to the MMNU leaderboard. We'll come back to it because from now on we're going to score every model on it, see how it does. Should we crack on? Let's go. Okay. So I guess the next thing to talk about is pre-training with vision language models.

And what's been learned about that? So what do people do these days? And the model we'll use for examining the pre-training recipes is the QNVL series, which is from the team Alibaba. So this series of models has two of them, QNVL and QN2VL, 2023 and 2024 respectively.

So they're self-retention models. They follow the self-retention autoregressive architecture, and they're using Quen's language model, Quen-LM, I think it's called, as the backbone, and they're using a vision transformer as the encoder.

to connect the two, they're using a single cross-attention layer. So they're not adding many cross-attention layers to the language model. They're using a single kind of standalone cross-attention layer which similar to the receiver resampler, has got learnable queries. If you've got an arbitrary number of visual tokens, can we compress it down to a smaller number? And then can we inject them into the language model decoder?

This is really about, in this situation, this is really about using that mechanism just to reduce the number of vision tokens that go into the decoder. Do you remember we said in the water-aggressive architecture, you've got to unroll all of those tokens. they become part of the attention that the decoder's got to compute. So if we minimize them, that's better. So that's what they use for connecting the two.

If we talk about the training now for the QNBL language model, the innovation here is that they're actually going to break the pre-training into three stages. So rather than just doing general pre-training, what they learned and what everyone's done subsequently is that actually you want to break your pre-training into two stages and then do your supervised phone tuning.

So in the first phase of pre-training for the VLM, they're taking their image caption data sets and they're taking their interleaved data, so large quantities of data. They are training the vision transformer, and they're training the connector module, but they're freezing the language model. And in the pre-training, they're having all the images resized to 224 by 224. So they're using the kind of natural resolution of the vision transformer.

Presume this design so you've got not that much detail in the images. but you're able to process a lot of them and the vision transformer is of course small compared to the language model so it's okay to untreeze that And the cross-attention module, again, small, made to the language models. So that's the first pre-training they do. So the innovation is to add in a multitask pre-trailing step here.

And the idea is that for the second phase, you're going to unfreeze the whole language model. You're now going to allow the images to be a larger size as well. So in the first training step, they were 224 by 224. Now they're going to be sized at 448 by 448. And the vision transformer is fixed, of course. So what this means is we're going to split each of the images into four tiles like this. So we've got many more visual tokens coming in to the second phase of pre-training.

And by multitask, they're also going to be adding in synthetic OCR data. And then you're going to create a visual grounding dataset as well.

And so the idea here is to have a dataset where they've constructed lots of bounding boxes and references for objects in the image. And they will try and do this at scale with a pipeline that's going to build out this kind of Second free training data set at scale with bounding boxes in it, with references in the text, what's the bounding boxes are pointing at. And then they go add in, again, visual question answering dataset.

By the time they started training the QNVL model, there were a number of visual question answering or document question answering benchmarks and fine-tuning data sets. that have been released. And so what they did is just, and some of these are a considerable size in 10, 50, 100,000 images in each case. So they actually added these to the pre-training dataset. So now they're representing much greater task diversity in the pre-training mix.

slimming down the size of it somewhat. They're also adding in text-only beta. And the idea here is because they're going to unfreeze the language model, they're going to be mucking around with the attention mechanism now. And so they need to keep that text only data in there to kind of preserve text only performance. So the multitask pre-training, it's bigger than fine-tuning. It's a smaller data set than the original pre-training data set.

And we'll see this theme again, that because the multimodal data is scarcer, right? pre-training like this, first on your lower quality data. then on your higher quality data, and then finally taking a much smaller supervised fine tuning data set, which involves a lot more manual augmentation of the images and using a self-instruct process to come up with decent prompts. is the way to do it. And that actually, for the QNVL team, produced a really, really strong vision language model at the end.

Quen 2, which is the latest one in the series, is sitting just behind the leading models from OpenAI, Anthropic, and Google on the NMMU, but it's sitting above pretty much all other language models. The small model is all open sourced. The 72B class, which is their leader model, is available behind an API, but they do say in a paper that they plan to release it at some point. Okay. But I thought that it's an interesting model because

they kind of really broke down this pre-training process into two parts. They mixed in much more kind of diverse data into the second multitask pre-training and had a really dramatic effect on the capabilities of the language model at the end. So let me just try to summarize the narrative of the training of a model like this. You start with a language model. Then you throw your sort of high volume but low quality or mixed quality data at it. And that is the...

vision language pre-training step, which comes after the original language pre-training step. And the main purpose of that is to sort of get these things on the same page, right? Like we at least need to sort of sort of bring the latent spaces together. Once that's working, then you say, okay, now these things are, and I guess, why wouldn't you want to change the language model at first? Because basically, I guess you have a language model that you're at least reasonably happy with.

And if you're doing backpropagation through the language model, you're like, well, I don't necessarily know what's going to change there. And I don't want to be changing things and potentially losing capability when I kind of know what I need to change. What I need to change is the part that is mapping the images into the language latent space. So do that first. get to a decent place.

then open up training of the full model. Now it's like, okay, you guys are generally working well together. Now you're all the sort of constraints are off. Now we're going to. backpropagate through the entire thing and the I think the interesting detail a couple interesting details there but one was definitely The need to continue to mix in standard text only stuff, because again, you don't want to be over indexing on this one particular task type that won't necessarily.

I mean, what's nice about these things is they're super general, right? So you don't have to give them an image. You still want them to be able to work as normal. So you got to mix that. text only data in as you do this phase, but this is where they really sort of cohere or anneal or whatever is the right word into a single system that's all been end to end trained together.

And then naturally the last step is the data sets not given. And that's certainly been a very common theme outside of vision too. It's just that, you know, these sort of.

preference data or instruct data on which a lot of the frontier models are trained is highly proprietary technology that is expensive to generate from a recent episode that I did with Nathan Lambert who studies this post trading stuff deeply one of the major takeaways from that conversation was him saying In the absence of a frontier model like a GPT-4 that we use to generate the data that we then turn around and use for instruction tuning, we would have no way to get the volume of that.

quality of data that we would need to do this. I've probably referred back to that on like half of the episode since because it's I think a really interesting data point for where things are going in terms of like as they start to clamp down on reasoning.

We've now also got Google's thinking model that is sharing its reasoning. So and Chinese ones, too, that are sharing their reasoning. So the final chapter by no means is written. But it's really interesting to see how that instruct that final phase of training the data that. powers that has to be so high quality and is so expensive to get and therefore so valuable that it's typically not released meta for all their openness is also not releasing that kind of data.

And so people are currently left to try to generate it with GPT-4 if they want to do something or whatever in the wild. But what is the future of that? Is that still going to be a viable option? Maybe with... Gemini you know showing their reasoning traces but You know, maybe not. We'll see kind of how the dynamics evolve. But yeah, I think that's good. And it is remarkable how I always marvel to it. Like Chinese companies are not far behind. You know, I think.

I don't know if you have a perspective on this. In VLMs, in fact. Yeah. Yeah, well... I don't know if now is the time we want to maybe do this a little bit later, but I'm interested to hear your thoughts on leaders versus fast follows versus, you know, certainly. I think we in the West, a couple of very broad terms, the broad we and the broad West, seem to be, in my view, kind of overconfident about just how much of a lead we, quote unquote, have.

relative to Chinese researchers. Even on some of these earlier papers that we've been discussing, there's an awful lot of Chinese names on papers that are coming out of Western institutions. So that's a whole other dynamic here. Yeah, we can either put a pin and come back to this, but I'm interested in sort of how you handicap this field and sort of how you think like...

how we should even determine, you know, I mean, the Chinese companies are definitely more open, but it seems like typically, I would have said, your open eyes and Google seem to get there a little bit first, but they're not open. So if you were to say they're the leaders in open models, I would say yeah, that seems Pretty apt. Leader overall, including the proprietary ones. I don't know if you would say the same or see it differently, but I'm interested in how you think about that head to head.

So we can answer that now and then I'll offer a couple of comments to your observation about high quality data in pre-training. So the next model we're going to look at is the intern VL model. And again, this comes from an organization called the OpenGV Lab. which is based in Shanghai University. And it's probably the leading open source model. And the story behind

The Intern VL series, it's been one about how do you scale these things up? How do you make them bigger and bigger and train efficiently? They're looking at efficient training and they're also looking at larger models. And one thing I think is quite interesting is that all of the top open source BLMs are kind of hovering around the 70 to 80 billion parameter mark in terms of size. The top proprietary models, we don't know. We don't know how big they are.

bigger than that. Maybe an order of magnitude bigger than that. The latest model that the OpenGV lab released They looked at how efficiently they could train. And they said they ended up training on 120 billion tokens, I think, of mixed data, right? And they said this was very efficient because as compared to the QNVL model, which we just looked at, the latest QN2VL, this had trained on 1.5 trillion tokens.

Shortly after they released the model, they released a dataset which has 1.5 trillion tokens in it. So it's one of the largest out there. So you can see already that they've gone to a 72 billion parameter model class. They trained on 120 billion tokens to get there. The data set they just released is 10 times that size.

So you can see, you know, my guess would be they're planning to scale up very quickly and get to a much larger model since they already have the top open source model and it's pretty competitive with everything else out there. I'm really interested to see what they come up with. And this, of course, is a team from a university, which is super impressive, without the enormous funding behind it that some of the Frenzy Labs have.

MMMU. Yeah. So, sorry, I'm being lazy. I'm talking about just MMMU, which is my benchmark of benchmarks. We should probably talk a bit later about a couple of the other benchmarks, which are really interesting. MMMU is kind of the big one in the same way that MMLU might be the default benchmark you show to when looking at language models. Yeah. Okay, cool. My kind of rough understanding and not necessarily rule of thumb, but...

I remember Llama 2 was trained on 10 trillion tokens. And I've always kind of... rounded to say it seems like the multi-modality comes at like another 10% cost like it's another 1 trillion ish tokens typically do you think that's like a Good intuition. I mean, this intern VL one seems to be notably.

significantly less than that it's like two orders of magnitude difference between the original full pre-training and the image portion of training But what would you say is kind of the normal, is this in fact an outlier in terms of how small the data set is? I think there's a lot of normal because what's happening is the data sets are getting much bigger now. So let's say at the start of this year, so the biggest data sets, multimodal data set you could train on would be the Leon.

which is 5.8 billion images with text captions. It's been filtered for quality, so there should be high-quality captions, and they're all extracted from the Common Crawl. So there's a German organization that produced disability. And that, I think, is the largest image caption dataset that's publicly available. But just in the last few months, we've seen an interleave dataset called Mint, which has been released by the multi-contributor team, but Salesforce were behind it.

So that's a trillion tokens, that one. And it includes not just HTML, but it includes PDFs, archived papers, things like that. And the OpenGVLAC, who the team that reduced the intern VL model. have just released a dataset called Omnicorpus, which is, again, another interleaved dataset. It's 2.2 billion documents sourced from Common Crawl dump. They say it's got 8 billion images in it and 1.6 trillion text tokens. So those are just much bigger than anything that was previously available.

I would say you're probably about right now by thinking a trillion tokens is about right. Organizations like Meta do have access. You know, and when we talk about Alarma 3D, we see that is Russia, one of the token counties for the vision model there. But for open source researchers, the last few months have really seen the arrival of these huge interleaved datasets, which has kind of really jumped the pre-training dataset size that's available.

Exactly. Okay. So we've looked at the Quinn VL model and we've looked at the multi-part pre-training stage. where we can use our lower quality dataset at the point where we're training the vision transformer and the cross-attention layer. And we don't want to muck around with a language model at that point because, you know, that's been heavily trained in the past. It's doing great tanks. We don't want to mess that up.

So we don't want to use low quality images and captions and be training that thing. So we just trained the vision and transform in the cross encoder is when we come to the multitask pre-training and we've got a lot more attentional care into the dataset that now we're going to allow ourselves to train the language model. And that I think was.

just in response to the reflection you made earlier about why do this, or the question about why do this in multiple stages like this. Yeah, it's going up the quality scale, down the size, and untreasing more parameters as we go. And that recipe just works really well. I guess the last one for now that I wanted to cover is the InternVL series from the OpenGV Lab. And this is a team from Shanghai University in China.

One of the challenges that's facing particularly open source researchers building vision language ones is the training cost. particularly when you have to take a large model, untreeze the parameters, and fine-tune it or do continued free training because you're now introducing this extra modality.

Three models, or four models, actually, they've looked at some ways that they can improve the capabilities of the open source language models. And one of their papers was in fact called, How Far Are We's GPT-4V? Closing the Gap. So that was for their 1.5 model. But I'll just touch on each of them, one after the other, because I think each of them introduces something really interesting to the picture and tells us something about where vision language models are going. which is called InternVL.

they observed that everything today is... We've been using a vision transformer that's been relatively small, and it's also been pre-trained separately from the language model that it will eventually be connected to. So if we started, for example, with a clip that's been used in the clip model, it's of course been aligned to a language model encoder, but it's a difference. language model to the one the decoder will eventually want to connect it to.

So what they did is they started with a fresh vision transformer and they scaled it up. So there's a 6 billion parameters, which is large for a vision transform, right? So the Vithuge has around 600 million parameters in it. And I think the biggest of the standard vision transformers is like the vit G and that's about 1.8. billion parameters. So this again is several times that size. And what they do is they conduct a fresh kind of contrasted pre-training of the model.

They use a frozen Llama 7B, original Llama, as a decoder. And what they do is they take the same way you would train a clip model. You take images and text pairs. and you start to feed both of them into each, you know, the text into the text model, the images into the vision transformer. You then take the hidden states from your llama. You can then take an average of that embedding. That's going to be how you embed the text.

And then at the end of the division transform, you're going to take the states from that and you're going to, for example, pull that somehow. And then exactly as you do with the clip model, you're going to do a contrasted pre-training step, right? So you're going to try and get the correct caption pairs to be similar, the old ones to be far apart.

So not so different to Clip from several years back, it's just we're using a decoder now, and it's a decoder, it's a Llama 7B, so it's going to be much more similar to the language models we then want to connect it to in our VLM. and we also just scaled up the size of the vision transformer. So that was their first innovation here. And what they showed is that this leads to a really high quality and really well aligned vision transformer.

And what they can do is they can actually forget about the Lama model that they used for the contrasted pre-training and they can connect it to a totally different language model when they build a VLM and it works really, really well. The first innovation, just train the vision transformers from scratch using something much more similar to the language model you're eventually going to connect it to when you're doing the contrasted pre-training. Cool. That was a first intern VL.

The second one, the 1.5 model, they looked at image resolution. That was the big question there. So we talked way back at the beginning about images don't just come in 224 by 224 squares, or 4 for 8 by 4 for 8 squares. They come in many different kinds of resolutions. So how can we try and, if we've got a higher resolution image, how do we try and get more out of it? So what they did is they developed this strategy they called dynamic high resolution.

What this is, it takes an image, it could be of an arbitrarily high aspect ratio and resolution, and they segment it into tiles of a fixed size 4 for 8 by 4 for 8. And the number of tiles that they use are going to be based on the aspect ratio and the resolution of the image. So they try and match the tiling configurations, 4x4, 4x2, 1x2.

and try and match that to the image itself in its natural resolution okay Then I'm going to encode each of the patterns from the larger image separately, and then concatenate a thumbnail of the entire image to the end of the sequence. So now you've got your thumbnail and you've got all of your patches of high resolution in a great big long set of visual tokens. In that point, they've got too many visual seconds.

And so they use something called a pixel shuffle, which is just a strategy for trying to compress that down. And the way they do that is if you imagine you've got a patch from a visual tensor of an image, it's got the width, the height and the depth. One of the things he can do, obviously the number of tokens he's going to get out at the end is determined by the width and the height of the path.

So what they can do instead is they can just resize that. So in other words, there's more going to the depth dimension of the tensor, and they're able to split it up into fewer patches. And that's just called the pixel shuffle strategy. And because this dynamic high resolution ends up with so many tokens, that was the way to squeeze it down. And like quite a few of the latest VLNs have done the same thing. So the QUENT2VL does the same thing, does now dynamic high resolution.

But this is quite an interesting innovation. So now you can essentially process images at whatever the natural resolution is, simply by generating more tokens. And of course if you're using a self-retention or decoder-only architecture, you don't worry about whether you're putting in 100 visual tokens or 1,000, except for the expense of how to unroll them in the decoder.

There's no kind of hard, you don't need to have a perceived resembler that somehow squishes them all down to the same dimensionality at the end. So that's kind of the innovation from the 1.5 model for InternVL.

searching for something that I would just wanted to flag that While we don't know too much about how a GPT-4V works, we can look at pricing, and that gives us a pretty good indicator that it is probably... very similar under the hood because you can go even just on the OpenAI pricing page and click the low resolution checkbox. And then what you see is 75 tokens are your base number of tokens for any image. That's the minimum you can get.

that would seem to correspond to the thumbnail in this last scheme. And then if you uncheck the low resolution box and start to increase the height and width, what you see is that it starts with one tile. And the max I believe a tile can be is like 512 by 512. Once you get over either the height or width gets over 512. even if the other dimensions stay small. Now you're into two tiles.

still continue to get charged those 75 base tokens so it seems like they are probably doing something under the hood where it's like yep you're always going to have that full uh image as a single thing in low res and then the question is how many higher resolution tiles are you going to have that's going to depend on how big the image is that you feed us and basically you see this exact scheme reflected in the open AI pricing structure. So interesting.

I wanted to, this is probably on me because I was a little bit confirming that as you were describing it, but can you tell me the pixel shuffle thing again? I didn't quite grok it. Yeah. So the explanation I gave was the best to my knowledge because I didn't delve into this too much. But basically you had a height and a width and a number of channels. in any representation of an image. And the height and the width determine the number of tokens you're going to get out at the end.

So the idea is, can I just reshape the tensor such that there's more in the depth channel? And then I'm going to end up with Shira tokens at the end. Hmm. So in other words, if I go back to the OpenAI pricing thing, if I create a tall but narrow image, If a single tile can be up to 512 by 512, And I have a 513 by 150 image that can fit into that space. But in its natural orientation, it doesn't. And basically what you're describing here sounds like a way of sort of.

reshaping that so that it kind of fits into the space, which I would assume under the hood with data augmentation and kind of, I mean, there's a ton of that kind of stuff, right? Where sort of weird I mean, from the first right of just putting everything into the same 224 by 224. There's a long history of these sort of like programmatic manipulations of images to put it in some form, which might be quite strange from a human visual system perspective.

ultimately probably makes the AI more robust and in this case also, you know, demonstrates that you can save tokens, save money, save compute all at the same time. Yeah, I mean, as best I understand, it's about just reshaping the tensors themselves. that you've encoded. So you're actually shifting stuff from the X and Y dimension to the depth dimension. And it felt to me when I was reading about it, it was like, similar to what goes on in a UNet.

where you're actually just reshaping the tensors again and again, making them deeper and narrower. You're not actually losing information, you're just kind of adjusting where it is. So you come out with something very different, different dimensions, but you haven't actually, you know, if you multiply the dimensions out, it's still the same size. I guess the last of the intern VL models to look at is the latest one, which is 2.5.

And so we've looked at how we use dynamic high resolution, so that was in 1.5, looked at how we scale up the vision transformer itself in the original 1.0 model, and also do the contrastive pre-training using a much bigger language model. So the third thing that they looked at of note in the last 2.5, using what they call a progressive scaling strategy. So the idea is to try and train

efficiently on a large number of tokens. And what they figured out is because they built several classes, a model, they built a 7 billion parameter version with a mid-size volume and a 78 billion parameter at the end. What they did is they aligned the vision transformer to the smaller language model. They do the training process with a vision transformer in a smaller language model. And then they swap the smaller one out.

They introduce the bigger, you know, the next grade up language model, and then continue the training. And then they swap that out and they bring in the biggest one and continue the pre-training. And what they find by doing this, so progressively increasing the size of the language model backbone during pre-training. is that they reach convergence quite early on in the training process with a smaller language model.

And at that point, the vision model, vision transformer, has learned a lot of what it needs to learn in being aligned to this class of decoder language model. So they can swap that out and they can put the next figure one in and they can continue the training. And it turns out it's much more efficient to do this. then it's to start with a large language model, start with the large vision indicator, and try to align them both at scale from scratch.

And in the paper, they reported that this strategy of what they call progressive scaling They reckon they read 120 billion tokens during the pre-training Slack phase of the 2.5 model. And that's their nearest peer competitor, which is the QN2VL language model. I had to process about 1.4 trillion tokens to kind of reach the same complexity. So they recommend a very, very much more efficient way to do it.

And so there you go. That's kind of three strategies to use. Dynamic high resolution to how we process the images, how we're training the much bigger vision transformer, and then this kind of progressive scaling strategy. And this seems to be a winning recipe. Because if we look at the MMU leaderboard at the moment, the InternBL 2.5, the 78 billion parameter class, is sitting just behind 01. And that means it's beating GPT-40, the version from May this year, from the leaderboard.

the original 3.5 Sonnets and Gemini 1.5 Pro on the MMMU. So it's an extremely successful recipe. It also seems to be, by the way, on a visual question answering an OCR bench as well, which are two other important kind of benchmarks. That data scale of data thing is really interesting because it's not just that it is saving compute by starting with a smaller model. It's actually also that it's using something like one twelfth of the actual data set size. So you're saving on two dimensions.

And yeah, that's really interesting. That reminds me of an episode, which was one of my favorites that we did with a couple of guys from Microsoft on tiny LM, which was language models of this. I think of like tens of millions of parameters, like really small training exclusively on these like short kind of kid story type documents. And they looked at

Do these things learn and in what order? And it was kind of like you could see that they were learning things like parts of speech and, you know, really kind of structural elements of language for. And then gradually started to seem to have an understanding of like. nouns and you know what was what and then started to become coherent and at sort of the far end of their process they started to see like what I remember as micro reasoning skills which was like

you know, the farthest they push this. And again, these are very, very small models, but You could get to the point where it was like, you know, Sally doesn't like soup. So, you know, Jimmy offered her.

blank right and the earlier models would just put soup in again because it was like well soup appeared once soup is probably likely again yeah and at the end of their training they would start to see these micro reasoning skills where it was like well it's got to be something else besides soup given you know the full context of what has come I don't know that they looked at like

a progressive scaling from those really small models up. But do you have any intuition for why, I guess what I'm, what I'm kind of. struggling for and maybe there is no good answer right now but it seems like We're saying something here that is like, The small model is actually learning faster. The small model is more sample efficient.

Yeah. So let's imagine what we're trying to teach the models here, right? So the same way you just described with the small language models. Okay. So there are progressively more complicated... understanding and modes of reasoning that you can learn as training continues.

And we can imagine that larger models can have more capacity to learn more of these things. But a larger model, of course, has many more free parameters during training. So if you have a lot of free parameters, it's going to take you longer to find the right kind of basins in the gradient descent that actually represents good capabilities, reasoning capabilities and understanding capabilities.

So the idea here, and I think this is why this is working, is that if you start with a smaller language model, you start aligning your vision transformer to that, there is going to be some catch. on the degree of complexity of the tasks it can undertake. But because you've got fewer free parameters in the training setup, you're going to find good solutions sooner.

And at that point, you can then use the larger of your two language models. You've now got more free parameters, but you've already found a reasonable place in the search space that you're starting from. And so you just continued the pre-training from there. Now you've got more potential capability in the larger connected model, but you're already starting from a good place. So you don't have to search as widely. So that would be my intuition as to why does progressive scaling strategy work.

Yeah, I mean that is quite interesting. There was a recent claim out of one of the Chinese companies, I forget which one it was exactly, but they basically said that they had trained a roughly frontier class model at like single digit compute requirements of what they believe the leading developers in the The West, quote unquote, the West had the big monolithic West had used. And this sort of thing could be a really interesting.

Yeah, that's a striking data point. I mean, the idea that you can do it with under 10%. Again, it's not just that the parameters are fewer and you save compute that way, but it's a compounding savings. because you're also using far fewer data points. This would probably be Relative to, you know, if they did Quinn to full size for 1.4 trillion tokens, and then this progressive thing only took.

120 billion tokens it would be like this low to mid single digit compute so yeah that's definitely really it's worth a ponder Yeah. So we've come now to the Intern VL 2.5, which is the top open source model out there on the MMU benchmark. And you may have noticed that since we talked about flamingo, everything else we've talked about has been based on this autoregression. So potential architecture.

And so one thing you may be wondering is, okay, does that mean this cross-attention architecture is dead? And the answer is no, even though most teams have opted for this kind of autoregressive architecture. just to be different maybe. When the Lama 3 vision models were released earlier this year, the Lama 3V was based on the cross-attention model. So they used a vision trans, a VIT-H14, which we know what that is now. It's a 600 million parameter vision transformer.

But I think they also introduced some new cross-attention layers actually into the Vision Transformer itself. So they made some modifications to it. The report given, like the report for the LAMA3V model goes into a bit of detail about some of the modifications were made, but it doesn't give you quite the same level of detail that you might get in some of the other papers covered. So we know that, for example, they modified the vision transformer.

And we know also when they did, they used some very large retraining datasets. They did the same things that a lot of other teams did adding kind of machine generated OCR into it. A lot of effort spent improving the safety filtering, desensitizing, deduplicating and quality filtering, the multimodal datasets that they use. So that was a real big thing there. They also did some synthetic augmentation of their pre-training data sets, so it's kind of the same way we've seen the QuenVL team do.

So this is like adding synthetic captions, generating tabular data, generating latex documents. So adding a lot more of this stuff in at scale into the pre-training. And then when they do train the model itself, the Llama 3 model, they added new cross-attemption layers into it. And they then froze the rest of the Lama 3-1. So they only trained across attention layers and the vision transformer. And that is both during kind of pre-training.

and during supervised fine tuning. And they also did DPO at the end as well, which I think is the first time that I've come across DPO being used in the vision language models. But it's interesting that I've also seen it again recently. So it seems to be more of a thing now. One of the advantages, I guess, of this is that if you're just training the cross-attentional layers and you're not doing a full fine-tune of the LAMA 3 model,

you're preserving all of the capabilities of Lama 3, and you're not risking degrading them by introducing the vision component. And this might be why they decided to do this with cross-attention rather than using the self-attention. If we're doing the auto-aggressive architecture, we've really just got that projection matrix to play with. And once we've trained that as best we can to do the alignment of the text and the images...

If we want to keep improving the model, we've basically got to unfreeze the attention mechanism and train that in the decoder. where they didn't have to do this because they were using quite a large 90 billion parameter Lama3, 25% of those parameters roughly being introduced for their fresh cross-attention layers. Training on those days, that's a lot of free parameters.

So they just did that when they trained the model. And the LAMAS 3.2, the 90-billion-parameter version of that, is the second-placed open-source model on MMM and Neo at the moment. So it really shows that... so just behind the end, TurnVL 2.5 is the one that we talked about. So it kind of just goes to show that it actually really doesn't matter. You can build frontier... open source vision language models using either of those two recipes. And that was what they showed.

Is there any performance difference or any sort of practical? I mean, I hear you on the like. And having seen just a tiny bit of llama development from the inside, not really the inside, but I participated in a little safety review project for llama three. And I mean, it just, it was. A lot of people move in a lot of different directions. That's kind of how I would summarize what was going on there.

amazing on some level that the whole thing comes together. I suppose that's probably always the case, but out of open AI, out of Anthropic, you see these sort of small focused teams, or at least that's the perception from the outside, the, you know, you know, felt like 10 different projects going on at once. So I could easily understand and interpret this as being a reflection of

Just how much more sprawling the organization is and multiple different goals and maybe also thinking, geez, you know, not everybody wants or needs vision. You know, let's create that modularity for the open source community that's going to use this downstream as well.

All those things I get. Is there anything that people should have in mind about the different architectures leading to different results or does it really seem to be just kind of either way works and as long as you do a good job you really can't tell the difference after the fact Yeah, so there are some differences in that. There are some ways in which the architectures lead to different results. Obviously, on the design side, I mean...

Introducing the cross-attention blocks adds more free parameters to the vision language model than just introducing a simple projection matrix and going down the ultra-aggressive route. So with a larger model, like LARM3, there are probably enough, I guess they figure there's enough free parameters introduced by the cross-abtention block. that we can train our vision language modality alignment and get a really good result just by training on the newly introduced block.

Whereas, as I've said before, the downside of using the self-attention architecture is that if you've gotten the best alignment you can just out of whatever MLP you use to connect. the vision tokens to the language model, then you've got to unfreeze the language model itself. And at that point, you've got to worry about degrading the capabilities. So it could be that the decision was taken with Lama 3.

It's a big model. We want to preserve all of the capabilities of our language model. So let's not muck around with a lounge model. Let's introduce these new cross retention layers and let's just see if we can align it. for vision language tasks? And if we do so, is it good? And the answer is, yeah, it's really good. Provided you put a lot of attention into, you know, cleaning and curating the data set, which they did.

And they used a lot of tokens as well, a trillion tokens or so in the trailing budget. Which, if you've got a lot of parameters to change that you just introduced, kind of makes sense. So the recipe works. There is some indication that the cross attention model is not so good for things like OCR and maybe also not for some forms of multimodal reasoning. For OCR, the kind of the...

The reason for that might be because you need to introduce these, you've got these new cross-attention layers and so you've got to use something a bit like a procedural sampler or something else to kind of fix. the size of your vision tokens so that you can actually look them up. And for this reason, something like a procedure resampler is doing a bit of shuffling of the visual tokens at quite a fine grain. And this might be why it's affecting OCR performance.

So a few teams have noted this. The cross-attention architecture is not quite so good when it comes to these tasks that require a really fine-grained understanding of small areas of the document. So that's kind of one downside. I've also read some authors speculating that

multimodal understanding and so like reasoning is actually better in the decoder only, the self-attention architecture as well. And this might, you know, it's not really known why, but it's just, it seems that putting the vision tokens into mixing them up in the same sequence with the text tokens and then decoder-only architecture. The attention mechanism just finds it easier to find a reason over that. But again, nobody really knows why. These are just two minor findings.

And I think it's probably enough to steer people. in general, over towards the kind of decoder-only self-retention architecture, which I think is probably going to be the one that wins out. Then as a caveat, which we'll come on to in a second, might not quite be the case. The only other thing I can offer at the moment is I think one of the better AI podcast episodes of the year was from late in space where Swix had Yitae.

previously and now once again of Google, but with a sort of middle period where he was involved with RECA. I hope I've seen that right. And his take was basically that some of these architectures are a legacy of the fact that they were originally different teams and and that's yeah you know again probably plays out much more at like a Microsoft or a meta or a Google where they had for a long time, you know, an organizational architecture or hierarchy or whatever that.

had people focusing on different modalities before the great unification of all architectures showed up. And some of that organization persists, even when there now is a sort of unification of, or at least potential unification of the architecture. And so, yeah, you can maybe see some of these echoes where it's like, the language model team you know is moving on to the next thing and now you're kind of taking the baton on this, and this architecture is friendly to that, but it does seem hard

to imagine and i know we've got you know joint pre-training coming up not too uh much deeper into the agenda it does seem like at some point the better lesson has to come for this right i mean there's no no escaping that forever presumably Yeah, it does feel like, you know, the Lama 3D model does feel very different to everything else that's come out of it.

meta or FAIR in the last couple of years. Like I said, they're focusing a lot on early fusion architectures, which we'll talk about in a minute. This does feel like a bit of an odd one out, going through the cross-attention approach. And not just nod one out given Meta's research, but a bit of an odd one out given all of the models that have come out of the last year or so, where they've really been tending towards the autoregressive architecture.

But like I say, it could just be a feature of wanting to preserve the performance of the Lama 3 on language modeling tasks just by introducing cross-attention layers and not having to muck around. with the model's already perfectly well-chain self-attention mechanism. Yeah. Okay, cool. We've covered a handful of what I think are some of the more significant models, and by no means have we covered all the significant visual language models.

But we've looked a bit at, you know, the importance of some of the things that have been learned along the way, the importance of interleaved data at scale, the importance of data augmentation in the pre-training mix. And it's staging your pre-training data as well. So it's staging a pre-training. So start simple, gradually unfreeze more and more parameters of the combined model.

adding in all this augmented data of high quality as you go. If you've looked at that, we've looked at some of these mechanisms for processing high resolution images. so tiling things and then adding a little thumbnail at the end looked at the progressive scaling So that is aligning your vision transformer and your language model backbone with a smaller language model.

And then once you reach and kind of plateauing in your training run, switching it out for a larger one and carrying on. And we've also looked at the importance of task diversity in instruction tuning, which was shown by the LAVA team. and just how important that is to getting your vision assistant at the end to be able to complete a wide array of vision language tasks. And where we're at now is there's kind of these two architectures, this self-attention or autoregressive architecture.

where we're just injecting the vision tokens directly into the decoder stream along with the text tokens. And then we've got this cross-attention architecture where we're actually injecting fresh cross-attention layers into our language model. And then we're using those to look at the encoded visual tokens. And a couple of recent research teams have tried to do a more systematic comparison of these two architectures and said, okay, which is better?

And one of them was a team from Hugging Face, and they've built a series of a couple of models, which they call IDFX, I-D-E-F-I-C-S. And although their two papers were building these models, really what they were doing, I think the team was doing, is exploring what makes vision language models work well. And that was really the thrust of their research. what they've done, what the Hugging Face team have done in their two-paper series.

is try to explore what happens if you take a particular decoder and you take a particular vision transformer and then you try out both of the architectures. and see what happens. So you've got the same training data. In other words, you're just keeping the experimental conditions the same and just changing how you connect the two and looking at what happens.

These are some of the key findings from their research. Well, first off is that if you freeze the language model and you only train the newly initialized parameters so these would be the cross attention layers or they would be the projection matrix if you're using the autoregressive architecture then the cross-attention architecture works a lot better, gives you better results.

That's perhaps not surprising since if you bring in some new cross-attention layers, you've got more parameters there to play with than if you just have simple projection in the autoregressive architecture. But then they say, so when they try to update the language model backbone, the auto-receptive architecture can perform much better. So one of the things they noted in, I think it's the second of their papers,

is if you try to do a full update of the attention mechanism in the language models, they discovered they had training instability. So it's really, really hard to get the training to work. So they just switched to using low-rank adapters.

and there's no issues and they were able to update the language model attention mechanisms and in that case the auto-aggressive architecture was performing much better Perhaps unsurprisingly, if they increase the size of the vision transformer or increase the size of the language model, both of those lead to a better vision language model. But if you had to pick one for a fixed parameter count,

They say you get more bang for your buck by increasing the size of the language model component than you do increasing the size of the vision transformer. So there you go. One other finding, and we have kind of referenced this a lot, if you do add a perceived resampler, which you can add in both architectures,

Something like that is necessary in the cross-attention architecture because you've got to fix the size of the lookup vectors that you're performing across attention. But you can also have something like that in the auto-aggressive architecture. And the reason you might want to do that is if you want to reduce the number of vision tokens that were actually being unrolled in the decoder. So you could do it for both.

And they obviously find that if you introduce a perceived resampler it does speed up training, but it doesn't necessarily have a positive impact on performance in the end. So that's interesting as well. And finally, they show that if you used the interleaved image text documents, which the Flamingo team and other teams have found to be super important. They performed an ablation where they left those interleaved documents out, and they found it had a dramatic effect on performance.

of the model at the end, and that in particular, adding interlead image text documents like this seems to really benefit few-shot learning at the end. Those are some of the conclusions from the eye defect series of papers. And they're not the only ones to have a look at this. There's a recent model called NVLM that came out from NVIDIA a few months ago. And there they did exactly the same thing. So they trained three different variants of vision language models. Well, actually two initially.

Two variants initially. They used a common backbone. So they used the language model from QEN. So that's the same one that's used in a QEN VL series. And the vision encoder, they used the intern vision encoder. So we talked about that earlier. That's the vision encoder at six billion parameters, so entirely new, kind of trained from the ground up along with a decoder.

So they train this at a larger scale with a decoder transformer to see if they get better performance from the vision transformer. So those are the two components used by NVIDIA.

They denote the two architectures, D for the decoder-only version and X for the cross-attention version, and they compare them both. And they discover after training for a certain number of flops, that the decoder-only version has got the best multimodal understanding, the best reasoning over images, and the best OCR performance. but they also note that the cross attention version was much more efficient to train. The reason for that, and we've kind of mentioned this earlier, is because

you have to unroll the full sequence of image tokens in your decoder and apply the attention mechanism across all of them if you're training autoregressively. So it's perhaps unsurprisingly that you get lowest training throughput. So that's interesting. They report that the perceivory sampler affects OCR performance.

So negatively, negatively affects it. And the thinking here is the perceiver resampler it's the way it does resampling it's probably shuffling some of the information in the fine-grained spatial dimension, and that does seem to be hurting tasks like OCR, which require a very high resolution view of the image. so that's their hypothesis to what's going on there

So I mentioned they did three architectures, and this is because the NVIDIA team then looked at, well, what if we did a hybrid of the two? So what if we had cross attention and self attention? So the idea here is trying to make it more efficient to train the model.

and that is by removing all of these kind of high-resolution image tokens from needing to be unrolled in the coder. So all of the high-resolution image tokens are now going to be presented to the models through the cross-attention mechanism. They take the thumbnail. So remember, if we're dealing with high resolution images, we typically tile the image.

encode all the tiles to make the stream of tokens, but then also include a thumbnail, an overall view of the image as well. So they do inject that into the decoder stream. So this is their hybrid architecture. So it gives the self-attention mechanism of the language model a copy of the image to kind of reason over.

So remember this finding that the decoder-only architectures seem to have better multimodal understanding and reasoning. It seems to be something about co-locating the image tokens and the text tokens in the decoder stream that works really well. So it's still got that, but the decoder is then also able to look up via cross-attention the high resolution tiles from the image when it needs to.

So you might think of that as a compromise, because it's clearly more efficient to train. And when I look through the results, you can see that the Decoder-only version, it does beat the hybrid version on OCR tasks still, but the gap is smaller, as you might expect, than between the cross-attention and the decoder version.

It beats it on chart understanding slightly. But what's really interesting is that the hybrid version actually beats both the decoder-only version and the cross-attention version on the MMMU validation set. Only slightly, but it does beat them both. So I think this is a really interesting kind of approach, this hybrid architecture. And it wouldn't surprise me if we saw this explored more in the future. So is there any way in which it's like strictly better? I guess I'm understanding that.

My super high level summary of all of that would be, it seems like the decoder approach where the images go in right alongside the text at the beginning and get the full workup is the best performing. The cross-tension is a lot more efficient, but has some relative weaknesses and some that are particularly idiosyncratic, but it's easier to do. A hybrid tries to get the best of both worlds, but if I'm like... open AI or DeepMind and I'm like trying to make the best thing I can make.

Is there any argument or result here that would suggest that this hybrid approach has a claim on, you know, in any way being the best thing you could make, or is it only appealing because it has the efficiency advantages? I would say it's too early to tell. So... The decoder-only architecture that NVIDIA put together, the D model, does win over the other two variants on, for example, chart understanding and OCR tasks.

But on the validation split of the MMMU benchmark, the hybrid version beats both the cross-abtension and the decoder-only version by about a percentage point. Which is interesting, right? Now, there aren't any other hybrid models out there that I'm aware of. So it's unclear yet whether this is generally a better approach. But in this particular instance, on that one benchmark,

And indeed, right now, if you look at the MMMU benchmark, you'll find that we've got the internal VL model, which is the top open source model. We've got the LAMA 3.2 vision model sitting right underneath it. and you'll find this on the MMMU leaderboard. You'll find the NVLM hybrid architecture sitting just below those two.

in terms of the open source models. So I think there's a really interesting direction. From my perspective, the jury's still out as to how else you could vary this architecture, what might be the best. So no convincing evidence that Decoder only is hands down the best way to go. I think this is an interesting data point suggesting maybe there's more to the story.

Interesting. I'm always amazed by how... simple the transformer is, you know, that's just a recurring point of amazement where I'm like, the tangled mess that is my own grain and all the feedback loops and everything, but basically you can get as far as we've got with none of that. So it, I guess. you know, I on one hand have sort of a strong prior that that'll just keep being the best because it's been the best for a while. But then there's another part of me that's like

Surely a more complicated architecture can work better, or we would have presumably evolved in a simpler direction ourselves. So yeah, I don't know which one of those should dominate, but... presumably the more complicated, I mean, the more I think about it, it seems like the more complicated architecture just given how many possible versions of that there are, not to say this specific thing that has been tried here, but

In general, it seems like more complicated architectures have to be better in some way, shape or form. You got to find them, though, and you got to make them performant on the compute. So those are huge advantages, obviously, but the current. remarkably simple architectures have, but. Yeah, that's cool. One other thing I was laughing at, I think you noticed at the very beginning of this section was the paper from

Hugging face was called what matters when building vision language models. And I would submit naming does matter somewhat. They came up with I D E F I C S. One and two, as you had to spell it out. What is that? Eye defect? Whatever that is, it's not popping off the page to me, and I think it's going to have a... the jungle of models out there of standing out without a little bit of a cashier name. So that would be as a reviewer who never actually read the full paper, I would say at the title.

You know, there's a question posed that they could have perhaps answered a little bit better with a better named model. But that's just a funny reality across all of AI right now. Everybody feel like the O3. in some sense, like totally insane and in some sense is kind of perfect for the moment that we're in. So they're certainly in good company for naming their models in a strange way. Yeah, I mean, there may also be trademarking issues with O2.

Yeah, I heard that. Who's got O2? I guess maybe they can't trademark it just because O2 is such a common thing. There's a big mobile provider called O2. Oh, interesting. Which could be one of the reasons. I was totally unaware of that. I was thinking that maybe

Because it's oxygen, you can't trademark something so commonplace, but I guess maybe it's running the other way. In any event, it is hilarious that they just introduced 01 and now we're on 03 and there is no 02. But yeah, somehow it does feel... So there's two other things I wanted to pull out from the NVIDIA paper in ascending order of interest for me. The first is a really interesting result halfway through where they looked at some of the leading open source multimodal.

language models, the VLMs really. One's a lava model, one's an intern BL model. And they noticed that if you apply text only, If you run text-only benchmarks, I don't remember which benchmarks they actually ran against the models, but you, oh yeah, no, they ran MMLU, math, human eval, a couple of others. They found that in the VLM models, there had been a drop in text-only benchmark performance as compared to the language model backbone.

originally. So for example, in the Lava One Vision model, whichever language model decoder they used originally, they knew what the benchmarks were across all these scores and then they just repeated it with the vision language model and they found a drop. and they found this drop consistently, amongst all the ones they tested, except for the Llama 3 series.

and if you'll recall when we mentioned the llama 3v series they introduced all these cross attention layers and freshly trained them but they froze the rest of the llama language model so they didn't suffer that degrading However, to kind of counter this, the NVIDIA team spent a long time building what they considered to be a really high-quality, text-only supervised fine-tuning dataset.

And as a consequence of this, what they saw was an improvement on all of the language-only benchmarks from the NVLM series of models. That is, as compared to what the decoder they used was scoring before. So that's very interesting, right? An improvement in the text-only performance after vision language training.

Possibly this could be down to just seeing there's an additional large pre-training data set it's seen, which it hasn't seen before. But one does wonder whether there is something about the interleaving of the two different modalities now which is somehow causing the transformer or the overall model to be able to reason better.

so i think that's a very interesting finding and it will kind of be backed up by the fact that they saw this improvement particularly in math And they note that in their multimodal fine-tuning data set, they had an awful lot of kind of maths questions in there, like geometry questions, for example.

And they think it improved the model's ability to do numerical and mathematical reasoning, even in text-only form, text-only benchmarks, even though this additional training data came in the form of images. So I think that's a very interesting finding. Yeah, let me make sure I have that clear because there's two things there. One is it usually the text only performance usually degrades if you just do image data set.

without maintaining text only data in the mix. So they did that. And then in this finding where they see an improvement in the text only math encoding benchmark. That is the same thing where there are continuing to have some text only data in that mix. If you perform supervised fine tuning provision language model, you'll see a degradation in text-only tasks.

And we know this from large language models themselves, you know, if you do an additional round of fine tuning on a particular task and you want to preserve the capabilities of other tasks, you have to mix in fine tuning data just to preserve that.

They're not the first team, and this is not the first model where text-only data has been included in the supervised fine-tuning, but they seem to really go to town on this and draw it out on their paper as something they really paid attention to, having a large text-only fine-tuning dataset.

As we mentioned, the Lama 3D models saw no degradation, but then they froze the language model backbone. And the NVLM models actually saw an increase in their text-only performance, and they saw this particularly on the maths dataset. The introduction of mathematical questions in image format has improved the model's mathematical reasoning overall, with the result that, on text-only maths questions, it now gets higher scores than it did before.

than the language model backbone did before. It kind of recalls the legendary Rune tweet of shape rotators and word cells. It's like the language model has with the addition of the vision modality has perhaps Gained a shape rotator capability that it didn't have when it was. text only. Yeah, it certainly makes sense. I mean, if you had never seen a drawing of a triangle, and only, you know, I'm kind of imagining your SAT type math problems here.

If you never saw that, you know, any of those diagrams and were forced to do it purely through text tokens, that would be weird. You know, I think there's a clear reason that we do that for ourselves. So it is interesting to see a similar. pattern popping up here. Yeah, it's perhaps not surprising. It's just really interesting to see it actually kind of verified in the research.

And it brings me back to one of the questions that we posed at the very start, which was to what degree is multimodal understanding? important for achieving the levels of intelligence, let's say AGI that we want in the future. Well, here to me is like a small data point that is suggesting, look, there are real benefits to doing this. It doesn't say it's necessary. It does say it helps. So I think that's the interesting implication of the result. Yeah. Yeah, that's something.

worth a little meditation i mean you can imagine that that can go a lot of different directions That seems like it will be a huge trend. And we're already seeing, of course, like more modalities beyond just image being added, video, audio, et cetera, et cetera. I've often wondered.

just how far that can go. Like, are we going to see natural language models that also get trained on like biological sequence data? Because that's a whole other track that I've been quite obsessed with recently where there's been all this training on And in a way, it's really cool that there's not natural language there because I think it sheds really interesting light on how things are working when it's like,

It's picking this up from all the things that it's learning about proteins and so on. It's learning those from raw data in a way that is not mediated by human understanding in a lot of those scenarios. And I think that for me has been extremely clarifying when it comes to questions of, can these things learn new concepts on their own that humans don't know?

determine that through natural language models because everything's kind of out there somewhere and you know what's interpolation versus genuine out of distribution whatever but when you see these higher order concepts emerging from pure DNA sequences or pure amino acid sequences. It seems like, okay, there's something really there that is pretty undeniable. And then I wonder, you know, do all these things sort of converge? I mean, it seems like the global... Maximum probably is.

a model that, and this may not be economical, there may be all sorts of reasons why it doesn't happen in the immediate future, but it seems like the global maximum has got to be something that is just literally trained on everything, you know, and has.

the text and the image and all the way out to like weather data and just has this like very robust sort of all to all understanding and yeah it's just like one one little data point that suggests that that's true but it i feel like Big picture, it's hard to see how that wouldn't ultimately be the case. Yeah, part of the story, we kind of said at the beginning that vision language models are really interesting route into understanding how you actually

exploit the relationships between two different modalities to improve reasoning. And some of the story we've told here is about what has been done over the last couple of years to get better and better at doing this. And I don't think we're at the end of the road yet. We've seen some interesting ideas so far, particularly this hybrid architecture from NVIDIA, which suggests there may be more droplets you can squeeze from the lemon.

in terms of getting more efficient cross-transfer of information between the modalities. So it's really interesting, and it's been a tremendous amount of progress in two years in this one area. Yeah, this cross attention thing would also be... probably the way that it happens if you were going to try to integrate

a protein language model or whatever with a natural language model, it seems like you could do the end to end thing. But for starters, you would probably grab two that work and try to make them talk to each other. you know, somewhere in the middle layers where all the representations are already there and you have like a lot of confidence that you're working with.

you know, things that have their own understanding respectively and just trying to bridge those in a useful way. Especially, you know, you can imagine too, as you have like lots of these modalities, it'd be an interesting question to... to try to figure out like, even if the

Yeah, and it's not even really that the end-to-end thing is strictly best because this hybrid one is really competitive. But I can also see just lots of reasons that you might for convenience, right? And we talked a little bit earlier about like...

how to some degree, these architectures are legacies of team structures. And, you know, if you break that out now over lots of different modalities, and you've got like a whole different universe of people working on biological models, then it might be really hard to

redo everything from scratch or get the data mix right, or all those sorts of things could be really hard. But if you have things that are working specialists across these domains, then I could see the cross attention approach being a really natural way to bridge them without having to kind of go back to the drawing board as much. Yeah, so a term we haven't introduced yet in this conversation, but which we could introduce now, is early fusion versus late fusion.

And so what this is describing is at what point in your information processing architecture do the modalities come together? And in everything we've discussed so far, the answer is we've got two separate encoders. We've got a vision transformer and we've got, I guess, a tokenizer for the language model. So we're encoding the two modalities separately and then we're fusing them in the architecture.

But we'll discuss something in a second where we're looking at early fusion, which is, okay, can we get one thing to encode them both, such that the modalities are aligned right at the start of the journey through the model. And I don't want to speak out of tongues, I'm not a neuroscientist, but my bet would be if you asked a neuroscientist, how does the brain work?

the answer would be there are early fusion and late fusion and probably multiple integration points for different modalities of data through the information processing pathway. And it may well be that the situation that AI eventually finds itself in is that you similarly have these kind of multiple points of fusion. We've already seen with the NVLM hybrid architecture.

You know, we've got two points of fusion there. We've got a cross-detention lookup, and we've got an image thumbnail being added to the decoder stream. So we've still already seen the first example of that. So one of the trends I would expect to see is that we have early and late fusion going on in future architectures. But I'm definitely not smart enough to say what they look like.

Yeah, when people talk about early and late fusion, to what degree is this a statement about the pre-training process versus the architecture? I've always been a little bit confused about that because... you know, in the hybrid, the early fusion and the later fusion are still with like

separately trained, separately pre trained modules, right, that are kind of learning each modality. And then it's like, there's a question of do I want to inject that image data at the beginning of the transformer or do the cross-tension thing in the middle but then I don't know if it's like It feels like you could sort of make a two by two matrix of these perhaps where you could have like, you know, I don't know about the fourth box of that matrix, but.

Joint pre-training is like another thing, and I'm kind of trying to untangle that concept from the early and late fusion. Yeah, so... First off, I don't think there is a settled and canonical split in architectures where you can say late fusion means this, early fusion means this. Think of it more as a continuum. I would say everything we've looked at so far is late fusion. and that includes clip because we are encoding both modalities separately and then aligning them.

And so for that reason, they are later than what you could consider very, very early fusion, which would be we have one thing that encodes both our text and our vision at the same time. So we have one token space, if you like, right from the get go. Right. And we will see an example of that before we finish today. Okay, cool. Well, let's keep rolling. Yeah.

So I just wanted to say a couple of words on this before we move on to a really interesting benchmark on a couple of the benchmarks that are very important at the moment that are also have been turned into. or are being used, the training splits in these benchmarks are being used as part of the fine tuning data sets and lobster models we see. And so the first one is the VQA benchmark.

This is a reasonable sized data set. I think there's 250,000 images in it overall. I don't know what the train validation split is. But there were images taken from cocoa. So that's common objects in contact. And possibly from other sources too. But the idea is that each image is associated with an open-ended question about the image, which was generated by Amazon Mechanical Turk workers.

Each question is supposed to require an understanding, both of the vision, the language and some common sense answers. And then you're given some, you know, if you're executing the benchmark, you're given multiple questions per image and multiple answer options per image.

So they reckon there's about a million questions in the dataset overall. This has become very important, as I said, as part of the fine tuning split. So just to give you kind of put this in context, let's get an example from the VQA dataset. So a photo of a woman who has got a pair of bananas drooping down from her upper lip like a moustache. And one of the questions is, what colour are her eyes?

And so I guess the expectation is that the VLM or the vision model is going to lock onto the yellow in the middle of the screen and answer yellow, which is not the right answer. But then it also asks, what is the moustache made of? so again this is requiring it to know where a moustache sits on the face what kind of shape a moustache has and then what objects are performing that role in this image So that's kind of an example from the VQA dataset.

So common sense and reasoning over images. Very different notes to what's in MMMU. So if one was looking for reasoning over images, that's the thing you wanted out of your VLM, the MMMU benchmark is great. If what you were looking for was kind of an understanding of the common objects, their relationships, you know, what's going on in an image, the VQA benchmark might be what you're looking at, the visual question answering. There is also a kind of variant of this called the Doc VQA.

Again, it's got a large training set, so it's often found in the fine tuning mix. That's about 50,000 questions that have been generated over about 12,000 images. And these have been extracted from a data set described as industry documents. okay so this is things like pdf scans of documents from the workplaces charts graphs like tables of data invoices business infographics and handwritten notes that kind of thing

And the tasks in the benchmark are to isolate and report kind of precise spans of text from the images that answer a question. So for example, it could be, what is the number on this invoice? And then you're followed by a PDF scan of an invoice. So this is an interesting benchmark because this is the kind of thing that a lot of people want to use vision language models for.

right is processing scans of documents and if that's the use case that one really cares about then performance on the docvqa benchmark is the one to look at And just as a final word on fine tuning and instruction tuning data sets. The Hugging Face team of the unpronounceable IDFX model fame have bundled 50 of these fine-tuning datasets up together. Much better name this time. Yeah, they've called it the Cauldron.

So the Cauldron is obviously available on that platform and it's probably the easiest way to acquire a good fine tuning data set. And the reason I mentioned this is because We've talked earlier many times about how augmentation of images in datasets has been really key for actually learning the alignment between the modalities. If I was starting with a task which was going to require a vision language model and I was struggling a little bit to get, you know, the performance I wanted.

One of the things I would do is look at the cauldron. I would find a task that seems to be similar to the one that I'm doing. And I would look at the augmentations and the prompt structure from that particular task dataset. And I would ask, is there any way I can do augmentations on my own images, or can I restructure my prompts such that it looks like this dataset?

Because whether it's just an inference time or whether you're going to build your own SFT dataset, this would probably be the most informative and useful way to go about it. So that's a cheat code, in my view. Shall we linger on the Blink benchmark for a minute? Because this is fascinating. So we talked about MMMU. I've just mentioned the VQA and Doc VQA benchmarks, which I think those are super important.

Blink is a really interesting one that was produced by a mixed academic and team from Allen AI early this year, July, I think. So Blink contains about just under 4,000 multiple choice questions. and it ranges over 14 common perceptual tasks. And these are tasks that they say humans can solve in a blink, but which should be difficult for VLM. Across these 14 tasks, human performance is in the mid 90%. And because they're all multiple choice questions, random guessing gets you just over a third.

The thinking behind the Blink benchmark is that lots of the questions on MMMU are actually about reasoning, and the authors describe them as they almost reduce to a dense captioning task. Can you just extract a dense description of what you see in the image here? Another way of saying that is, if you were to replace the image with a rich description of what's in the image, a language model should still be able to answer a lot of the questions.

and indeed when we mentioned MMMU that was actually one of the baselines that the MMMU team created was to caption with a lava model and then they saw that you could do much better than random guessing just from the captioning. So the interpretation of this that the Blink team made is that a lot of what MMMU is testing is reasoning. and that less emphasis is being placed on the classic visual perception capability.

And one piece of evidence for this could be that if you look at the jump in MMMU performance between GPT-40 and GPT-401, there's been a huge jump. And if that's all attributable to reasoning, it's suggesting that that is a lot of what the benchmark is measuring. So probably should talk a little bit about or introduce some of these categories. What do the Blink authors think is meant by perceptual reasoning?

so i'll just cover a few of the categories first a couple that the vlm seems to do really well on and then a few where they seem to do really poorly okay with me so far Yeah, I like this because this definitely... calls to mind some of the challenges that we've had with aesthetics, which I mentioned earlier as well. But again, in the Waymark context, it's not enough just to know the content of an image. We want to make our users look good. the early versions of this were basically

First, they were not dense captions. They were very sort of not very informative captions. You know, you'd get like in the early version that there was once upon a time when the Microsoft API was the best. And then Blip became the best for us for captioning. But it was still, you know, this is two years ago. We're back to the beginning of the outline. You know, so... sparse, then we got these denser captioners. Those were much better for at least getting the relevant

but still no signal for a while on what actually looks good. And that has definitely notably changed. And I think these subtasks within this blink data set that you're going to take us through are a really interesting way to interrogate how exactly they have changed and kind of reverse engineered to some extent to like what the developers have done to add these capabilities when you look at these challenges so yeah

very interesting kind of look behind the curtain. Yeah. Well, if we peel back the curtain then on the best solve task, blink task is what they call the art style. Okay. At the time of recording, data that I can find is that the model that solves this the best is GPT-40. Okay. We don't have, I think, 401 results in Blink. Neither do we have the latest Gemini or Sonnet, I believe.

So take this with a little pinch of salt when I say what was the best, but it gives you an understanding of, you know, as of June this year, what did the rankings look like? The art task, the idea is you've got three paintings, for example, right? You've got One's from a prompt, which is a sketch, and then you've got maybe two others from two other art styles. And the question is, which of the two, numbers two and three, seem to match the image and the prompt in terms of visual style?

okay and if we're human this is very easy like one can look at this and say oh this is you've input a sketch there seems to be some kind of renaissance painting the second one and then the third one i mean i'm not quite sure what the style is but it looks a bit like a sketch to me so i'm going to say it's the second one and i know nothing about art and yet i can solve that very quickly

4.0 does really well on this, gets about 83% and is the top scoring model. Humans are at 95% and random guessing would give you 50. okay so art style seems to be sold i would say reasonably well by the current generation of vlms Again, another similar one is what they call visual similarity, which is actually solved rather well. So here you might get a series of two photographs and then a reference photo.

And the question is, which of the two photos is most similar? And in the paper, they show two waterfalls and then another waterfall. And one of them, the reference image, is taken from the same perspective of one of the images, and the other image of a waterfall is taken from a different perspective. So I immediately pair the correct one in my head. Humans indeed get close to 97% on this. And here, GPT-4 Turbo was the winning model at 80%. So those are two of the well-solved ones.

What I think is really interesting is the worst solved blink tasks. So these 14 categories, which ones seem to have the biggest gap at the moment? One of those is the IQ test. If you've seen the kind of images you get in an IQ test, the example given in the Blink paper is one where you've got a simple diagram with a sequence of shapes in it. And then you're being asked which sequence.

So I've got three in a sequence, and then can you complete the next sequence? Pick an image of these four, which seems to complete it in the same way the first sequence is completed. So that's, if you've seen IQ tests, you've seen lots of variants in this before. Humans get this right roughly four fifths of the time. Okay. Random guessing in this particular area, the benchmark is four options, so 25%.

And GPT-4 Turbo was the best performing model with 32.67%. So not a lot greater than random guessing. When I look at this Blink task, I am reminded a lot of the Arc AGI question. And I don't know if that's a lazy analogy or an ill-informed analogy on my part, but I seem to solve our KGI challenges in much the same way I solve these IQ tests.

Am I doing guided program search based on perceptual priors? Because that kind of makes sense as an explanation as to why I can do it in a blink and the language model is struggling. So I think it's a really interesting one. And I think, you know, my interpretation is that this really backs up a lot of what Francois Chollet says in how he designed the RKGI benchmark. Yeah, maybe just two interjections from my experience. One on the, even the well-solved task.

I would say if you're actually going to do something like this in an app, the way you ask definitely still matters if only for avoiding... uh, unnecessary refusals. Like I found that GPT 4.0 and, and cloud, you know, three, five, probably similarly really doesn't want to tell you you're ugly. So and that's, you know, presumably reflection of its reinforcement learning from human feedback. And it's, you know, Klotz virtue ethics, right? It's it's trying to be a good friend to you. So it doesn't.

it sugarcoats. Sometimes if you say like, you know, is this image an attractive image, for example, it will sort of often hedge and be like, well, it's sort of in the eye of the beholder. So there's a couple different ways we found to Get around that kind of similar in some ways to like, you know, if you want a model to talk to you about a medical question and you're like, you know, getting the sort of I'm not your doctor. I always get around that by saying.

I'm preparing for a conversation with my doctor and I want to be as informed as possible and that it will basically let its guard down and help you. And in these like image aesthetics questions, say the the tercest you know rate this image one to five beauty or whatever it will sometimes balk at it but if we say Is this an image that a small business would be proud to put forward in their marketing? Then you are much better able to get a result. And I've also seen kind of similar things with

even just pairwise type thing. So, you know, took a couple selfies of myself and my kid. And then, you know, one of them I made look kind of a whatever contorted face that was like clearly not a great picture. The other one was much more normal. And

just saying, like, is this a good picture? It really doesn't want to shoot you super straight. But if I put both in, and I said, which of these should I send to my wife, then it will say, you know, I think you're better off with the second one or whatever. So Play around with that stuff if you are trying to get it to work. Like everything else, prompt engineering is definitely...

Still a bit of a thing, especially if you're kind of asking it to be a judge. You know, it's comfortable judging in some ways, and it's really not comfortable judging in other ways. Not to anthropomorphize it too much, but that intuition definitely can be helpful. On the RKGI thing, I was also just really struck. back when that was dominating the discourse not too long ago, by how weak even the best models were when it came to simply describing the art.

images. So you just took a few screenshots right off their website. But then, man, I didn't ask it to solve the problem. My first thing was just, can you see what this is? Can you describe it? And it was very not good at that, actually. Not good at even the most basic stuff of like,

how many squares, you know, what is the dimension of the grid? I was really amazed by how week that was and Presumably that just reflects a lack of that sort of thing in the training data, but it's still pretty surprising because It's like pretty good at reading tables and, you know, they can do OCR reasonably well. And you would think that it would be able to like count the squares, but.

really weak. It was interesting to see, too, just in the last couple days with the 03 results that, as far as I know, they weren't doing any image and literally just presenting, which is, you know, the actual underlying data beneath ARK is literally just arrays of arrays, right? They literally just give you numbers and all of the sort of color representations that we've seen on top of that are basically a gloss for humans to make it easier for us because we're obviously good at colors and stuff.

and recognizing these shapes when they're like in contrasting colors that plays to our strengths you don't have to do that you know for the arc thing and it turns out that o3 seems to not be doing that and is literally just reasoning over text tokens so yeah What's your theory? Do you have any other sort of deeper theory on why these IQ tests are so poorly solved?

I am not smart enough to be a theoretician, but I've just noticed, to me, it just reminds me. Speculation. Let's call it a speculation. You're smart enough to be a speculator. Reminds me so much of RKGI. The whole point behind the Blink benchmark is that humans can solve this. like in a snap. So there's something about the perceptual priors or the perceptual features that we extract as a natural part of looking at the image that means we can very quickly answer these IQ test questions.

And it seems that understanding of images in VLMs just doesn't work the same way. So maybe they haven't got the same kind of perceptual scaffolding that we do. And therefore, I can answer it very, very quickly because I don't need to do an exhaustive program search. I've got some priors which can help me zoom in on the correct rotation, if you like, of the image in the IQ challenge. Therefore, select the right answer.

Otherwise, I have to do some kind of exhaustive reasoning over the possibilities, as, you know, Shallow would say a program search. So, you know, the job of priors in this case is to constrain your search space, right? So you zoom in on the correct answer very, very quickly. So my interpretation would be, yeah, they're probably not. understanding the kind of perceptual elements of the image in the same way that you or I would.

And that doesn't mean that you can't do very well on a benchmark like this just by adding reasoning. And my guess would be that 403, given a lot of compute time, would do a lot better on this. than 4.0, which is what has been measured. And simply because it can reason over the different possibilities, that would be a gap. Whereas with the RKGIE benchmark, you're presented with essentially a matrix.

you know, an array, that's the, the input actually comes in as an array. And so you can kind of reason over that as a series of text tokens, whereas The VLM has got to process the image and the blink challenge, so it's not quite clear that you get such a clean decomposition of the IQ test image into tokens which you can manipulate through reasoning in the same way. But I'm going to guess that reasoning plays a role in solving this and that you can

essentially like brute force your way through problems in the Blink dataset, perhaps by adding reasoning, but it just doesn't feel like the most efficient way to do it. And I think that's probably the point that the authors are trying to demonstrate. Yeah, and I mean, you're not given any, in this case, you're given only the image, right? So that's a notable... difference from the RKGI where you are given something that is text token representable and

the option, you have the option to, you know, use images too, but you don't have to do that here, you have to confront the image and make sense of it. Is there anything in this recent line of work? I mean, Again, I don't know a lot about the human visual system, and I also don't know a ton about old...

convolutional networks. But my general sense of the human system is that we sort of gradually as the information is being more and more processed through layers of the brain that we're going from like very simple you know, edge detection type things and angle detection and all these sort of like features basically up to more and more semantic features as we get to higher levels. And that

definitely has been demonstrated in certain vision systems in the past. They're labeled that way in the human brain. Do you think that there's something about this sort of vision transformer that's like not doing that? I haven't seen, seems like that work was a little older and obviously the paradigm has shifted. And convolutional networks also can have much more engineered features where you can put these kind of specific priors into the structure itself of the convolutional.

whatever processing for lack of the right precise term. That's absolutely right and it was pointed out in fact by the IDFX team. much more work has been put into looking at language models than has been put into looking at the actual architecture of vision transformers. And you're actually right. One of the One of the things about a vision transformer is that those inductive priors that we used to code into CNNs are missing. And in fact,

That's why at smaller scales, probably, CNNs actually do better. And this was when we first introduced the Vision Transformer. That was one of the findings from the Google team originally. Transformer takes over. you know as the scale increases but it's smaller scales of size of model and smaller data sets the cnn's do rather better so There is an open question, I think, as to what should we be looking at in terms of the architecture of the vision transforms itself.

Another observation that has been made, I can't recall where it's wrong, is that a lot of vision transformers are now trained using these contrastive learning objectives. And could this be weaker than perhaps some other way of training? Because when we train a transformer, we do it a different way. We had a generative pre-training.

recipe, which is extremely effective. So could there be such a thing for a vision transformer? And that's such a great question that we might actually answer it in 10 minutes time. So I'll just touch on one other example of a task that's very poorly solved by current generation of vision language models in the Blink benchmark.

and that's called the relative reflectance task. So here the idea is that you give the VLM a picture and you put a couple of markers on the picture for two different areas and you ask which point has the darker surface colour or are the colours about the same? So you always get three answers, so random guessing would get you 33%.

So in the paper they show an image of a hotel bedroom and it's got the kind of cream coloured headboard for the bed and a white wall behind it. But because of the impact of the light shining in through the window, you know, the pixels from the cream-colored headboard are actually a bit lighter than the ones on the wall.

however just looking at the image i can tell you that the headboard is cream colored and the wall is going to be white and so i know the headboard is darker and i can answer that in a blink as they say because my brain is doing some adaption the emissions accounting for the effect that the light is going to have where it's coming from and where it's shining so these are some of the perceptual priors i'm bringing to the problem

And it turns out that the VLMs have a really hard time solving this. Humans are kind of 95% of the right answers. In the VLMs, the top performing one was a lava model at the time they tested, and it was doing a shade under 40% on this benchmark. so just another kind of example that they might not be doing things the same way we are I also think that we should point out perhaps the most interesting observation from the Blink paper in my book.

And I don't know if they explicitly mentioned this, but if you just look at the performance tables of the different models they tested, you can see they tested GPT-4V and Turbo and 4.0. And you can see that fluoro has improved in a number of the tasks, but has actually regressed in a number of other tasks. which is really interesting. So for example, there's been a significant regression in counting. 4V was solving the counting tasks at about 60%.

4.0, they measure at 49%. So it regressed in its ability to count objects in the images. It's really not clear why this is. Is this an artifact of distillation? Is this an artifact of fine tuning? Not sure. But it's a result. Yeah. On this, I can only say these things are weird beasts and find similar things in other areas too. I mean, not so much on like major benchmarks. I don't know if this would necessarily count as a major.

benchmark, it's not usually one of the ones that they would report in kind of a headline table. So they don't let those typically degrade. But certainly every time there's a new model. somebody's got a complaint with it, right? With the super wide range of use cases that are out there. Yeah, I wonder if they're even measuring this kind of thing. I mean, they measure a lot, but...

Are they specifically looking at the Jigsaw score on the Blink? And maybe almost by definition not, right? Because this paper came after some of these models, right? Yeah, June, I think. June, I think, this year. Yeah, interesting. But counting is a really interesting one because on the surface, it feels like a really simple task to count the number of objects in an image.

And across the blink counting task, it's one of the kind of, it's in the middle of the pack in terms of how well it's solved. They measured the Lava 1.6 model, 34 billion parameter variant, leading the pack of models with 66% success rate, human scores near 98%. In general, just a four-choice question, so the random choice baseline is 25%. And I actually found another data point, a paper called the effectiveness assessment of recent large vision language models, again from June this year.

which was finding that a number of open source vision language models, including the latest Lava, at that time 1.5, were outperforming GPT-4V on counting tasks. So this is kind of a second data one. They weren't using the Blink dataset for this. They had another kind of methodology for this.

Very interesting to ask, like, why? And there's a team from DeepMind did a bit of work earlier this year. They were actually looking at getting diffusion models to produce the correct number of instances of an object in a picture. But as part of doing this, they did a scan across some of the commonly used image captioned pre-training datasets. And they found that there are some captions in there which do denote the numbers of objects in the images, but they said they're very, very scarce.

and actually possibly not enough to learn how to bind the number correctly to the appropriate features they're extracting from the image. And this might be one of the problems, just the scarcity of this kind of task. in the pre-training data set so it does make me wonder that you know a lot of times earlier we've seen the story of improving the ability of tasks task-specific capability revision language model has been a consequence of augmenting a dataset.

in order to be able to train it at modest scale on that particular task. And maybe that's the way forwards with things like counting. We just need an augmented data set. to be added into the kind of the mix here and that might improve things but you know not all is lost if you're trying to train a vlm to do counting if you look at the kind of anthropics notebook collection of kind of best practices for claude They show you how to use good prompting techniques.

In order to make Claude think through the counting task and decompose the image and then ask what it can see. And that does work a lot better. But again, it's showing you that the fact that, you know, reasoning can compensate for some of these perceptual deficiencies. But the perceptual deficiencies are there. Yeah. Okay, cool. So a couple of minutes ago, you were asking about, could we do, you know, is the vision transformer part of the problem here in learning this kind of rich

representation of the images. And you asked about, has jettisoning the inductive prize that we brought to CNN's been a problem here? And I said, well, you know, the other thing that I... heard people mention or read experts mention is that it's the contrastive learning objective that you train a vision transformer with that might be part of the problem. And actually, a team at Apple earlier this year took a look at this very question.

And they said, can we do multimodal pre-training of a vision encoder in a different way? I'm not using a contrastive learning objective. so what they did is they took a vanilla vision transformer okay and they asked if they can change the recipe for pre-training it such that they use a kind of a generative training objective So I'll explain the setup and then we'll see how it works, how they do it. So you start with a vision transformer and a transformer decoder, both trained from scratch.

And it's trained on a large mix of image caption pairs. The captioning is a mixture of alt text, it's a scrape from the web, and synthetic captions. So for example, generated by a Lava model, something similar. So the data is prepared by taking image patch tokens. So these are small snippets from the image and text tokens and always presenting them in. So you've got your image tokens first. and then your text tokens. Simple tokenizer in place. Training is done using prefix attention.

so prefix attention is where you basically randomly mask a subset of the visual tokens which all appear in first so it could be all all of them are masked or it could be you know a small number of them are marked And then the decoder has to generate the rest of the sequence, which will include generating the missing visual tokens. Remember, these are soft tokens, they're not from a code book, followed by all the text tokens.

And at this point is using kind of standards left to right attention masking. Okay. The loss is only calculated, though, over the non-prefixed tokens, because obviously you're going to feed in the masked ones first, so it's going to compute the loss over the production of visual tokens followed by text tokens.

you're doing generative pre-training so you're asking the decoder to generate the missing image tokens but you can't use the same loss function for text and image okay because all the text tokens come from a code book So you can do a standard cross entropy loss for the text tokens. For the image tokens, they very simply just use a mean squared error loss.

So the decoder generates an image token. It's seen some fraction of the image already. It generates what it thinks is the next one for the next patch in the sequence. And they simply compare it. to the real token and then just compute the MSE log. Okay, so that's a decoder-only training, a generative training recipe for the vision transformer. And they're basically training the decoder and the vid. from scratch in one step using a pair of objectives.

Recall that the idea here is to train a new vision transformer. And so what they demonstrate in the paper is that they can jettison the transformer decoder that they've trained as part of this recipe. and then connect their new vision transformer, trained in this way, to, I think they used a Llama 3 model, to create a vision language model. They connect the two in an also regressive fashion using a simple MLP. and then they train it on the lava.

supervised by tuning mixture, which we talked about earlier. And this means they can do a nice ablation. They can compare the power of the vision language model with the vision transformer created using this generative pre-training recipe. with kind of a drop-in, for example, a vision transformer, the same size, but trained on a contrastive learning objective.

And what they find is that they see improvements in all of the VLM benchmarks they tested against, but particularly in captioning and in visual question answering. So those are the two that see very significant improvements in. So this is really interesting. One of the questions was, is it the contrastive learning objective that is limiting the power of the vision transformer element of the recipe? Answer seems to be yes.

switching to the well-proven kind of generative pre-training recipe that we know works in language models works really well for the vision transformer as well. What I would be really interested to see is whether this AIM v2 vision encoder from Apple being injected into a vision language model has an effect on blink performance. I don't know if it will or won't, but I think it'd be really interesting to know if it did.

I think what we're going to see is more experimentation in the next year or two in this space. I think we're going to see the Vision Transformer come into scrutiny. then we're going to see more really smart ways of trying to adapt and enhance it, maybe even revisit the way it works, reintroduce some of the inductive priors that we lost. Yeah. So can we just linger for a second more on the difference between the contrast of training and this generative training?

I think I got it, but give it to me one more time and make sure I, uh, cause this seems like a pretty. important conceptual distinction. Training on a contrast objective, which pretty much everything we've looked at before now, apart from the very original Vision Transformer. has been done with is you want to encode your image you encode your caption or your paired text.

You've got two encoded vectors at this point, and in a contrastive learning setup, you look across your batch and you say, I've got this one true pair within my batch, and I want the cosine similarity of the two embeddings to be high. and I want the cosine similarity of all the non-true pairs in that batch below. and the contrastive learning objective kind of forces

the embedding space distorts it, if you like, such that you get that result. So you'll always get an image paired closely in the embedding space with its relevant caption with similar text. For this objective, we're using a generative pre-training objective here. And so what's happening instead is we're prepending each example with a load of visual tokens.

we're masking a bunch of them okay and then we're saying so in other words we're not computing the loss of amongst the first let's say three quarters division tokens and then we're simply asking a decoder to decode the rest of the visual tokens followed by the text tokens. We're jointly training the vision transformer and the decoder at the same time.

And we're just measuring its success in reconstructing the visual tokens by the MSE between what the true token was, next token, and what it predicted. And then we're evaluating its performance in text in the same way. So what it should be learning to do is to attend to the caption, the text that appears afterwards, it should be learning to attend to the image earlier. because he's using this decoding strategy but what they found was if you put all the image tokens first and then the text

A lot more of the onus is on trying to learn the remaining unmasked visual tokens, tokens from the image. And this appears to make nature a much stronger vision encoder than doing it the other way around. Now, that original vision transformer... It was just trained purely on images, and that was just filling in masked tokens as well? Yeah, the very first vision transform was actually trained as a classifier.

Oh, classifier, that's right. And I believe on ImageNet. And so, again, the output sequence contains a classification token. And they simply took that, and then with a simple linear projection... We're training it to predict which one of a large number of categories the image belonged to.

That was how the very first Vision Transformer was trained. But all of the language, all the VLMs we've looked at have basically used a contrastively trained Vision Transformer. And we talked about how that was done when we looked at clips.

Yeah, yeah. So with this Apple approach, would you call this early fusion or late fusion? Because I'm looking at the diagram in the paper and it's like... cross entropy loss and there's like a different encoder for the vision component so if you think of the fusion question as being like where in the architecture it happens it seems more late

But given that they're all being pre-trained together at the same time, in some sense, that seems like quite early. Yeah, I would agree with you. It seems like a bit of both, doesn't it? I'm going to call it late because there's still a vision transformer that's separate from the language model it's connected to. But the alignment between the two has happened a lot earlier.

I like it. I don't think I have anything else on the Apple one. And I see we've got the, I wouldn't call it the final frontier, but the next frontier is next. I think these, these papers are really interesting too. So yeah. I was wondering about just.

skipping over the chameleon one and just talked about the transfusion paper and the reason i was thinking that is because i think transfusion is just so much better like in terms of not only capabilities but also its training efficiency it feels like this is just great and it's the way that deep seek built their model as well i feel like this is the recipe but um i trust your judgment whatever you think best

and then we just we finish up with just like looking at the frontier lads offerings we'll just go through that Sounds good. I'll try not to derail us too much. No, don't worry. We've got plenty of time in hand. Okay. So we've mainly focused today on vision language models, which have been looking at understanding.

images. And to some extent video, we haven't really mentioned video today, but a lot of the models that we've described can handle videos because a sequence of frames from a video is not so different to a sequence of pictures.

and so you know a lot of what you get for you get for free basically there's a bit of johnson's video as a whether you consider it a separate modality or not it's kind of up to you there are some video specific benchmarks there's a video mme benchmark you can look at see how well the different models handle sequences of videos. But what we haven't talked about is what's been missing for everything so far is the ability to generate images. We've really looked at image understanding.

Obviously, the simplest way to have a VLM or an AI in general generate an image is to have it generate a prompt and then to hand that prompt off to a diffusion model and have the diffusion model generate the image. And indeed, if we look at the Frontier Labs today and the services they offer, this is what Gemini and GPT-4A were doing with ImageN3 and DALI, respectively. We just put behind that in parentheses.

Actually, it seems from the original GPT-4O announcement earlier this year that it is a true multimodal model and it can directly generate image outputs. I mean, the O stands for OmniInit. can generate other modalities too. But this capability has not yet been released. It's been promised but not released due to safety and infrastructural concerns. I think it's what OpenAI said. But it's interesting nonetheless. And the question is, how might this be working?

And there's actually been quite a lot of work come out of notably FAIR, Meta, over the last year or two, looking at kind of true multimodal models. And if you're interested, kind of the sequence of papers to look at is CM3, which is they called a causal mass multimodal model of the internet. and this was followed by a model called CM3 Leon, which is pronounced chameleon. which was then helpfully followed by a model called Chameleon, spelled chameleon, not the same thing.

But this was all part of Meta's exploration of kind of true multimodal models that could both input images and text and generate them both. But I thought the one I wanted to focus on today was the latest in this sequence from Meta, and it's called Transfusion. In the transfusion recipe, they're actually looking at going beyond simply just images and text, but it introduces what I think is the most promising recipe. for multimedal generation as well as multimedal understanding.

So if we focus on the image and text part of this here, so The approach is to pre-train a single transformer on an even mix of image and text data. okay but they're going to use during pre-training a different training objective for each of them okay so Previous models in this series from Meta, when attempting this, had actually quantized the image tokens before they entered into the model. Now, it's probably just worth saying very quickly

So we've talked before about how text tokens obviously can be looked up backwards in the code book. So they're quantized. But we've used the term image token very loosely because we actually know they exist on a continuum. But they don't have to. So one way to quantize image tokens is using a vector quantization method. If you come across the VQGAN architecture, this is one way it's done.

You, pass it through a variational kind of autoencoder layer to get some latent vectors. And you then have like a learnable code book. So you can imagine this as if you've ever studied how k-means clustering or something works.

you're kind of learning where your cluster of centroids are in your vector space as part of your training recipe okay so the idea is you then quantize to the closest of your quantized vectors in this kind of latent space maybe you've got 10 000 of them so you encode your image token quantizes the interesting and then you decode again using the variational to code a decoder part

And now you just perform a number of metrics to say, okay, I've decoded my image again, having encoded, quantized it, decoded it. And now I'm going to compare it to the original image. and then run just a number of metrics and calculate a number of losses over that to see how close I was to my original image. And I keep changing my codebook, my learnable codebook, until I'm getting images that look very close to what I put in.

And now I've got a codebook that I can use to quantize any image tokens that I see. So that's kind of how they were doing it previously. And they actually jettisoned this approach completely for transfusion. They said, you know, we're not actually going to quantize any of our image tokens at all.

we're simply going to pass them through a variational autoencoder encoder part and this kind of turns them into like latent patches, if you like, they're going to then pass it through either a multilate perceptron or through a unit. downsampling block. This gets them their latent vectors and they're simply going to use those as tokens and inject them into their transformer. Okay. So they're kind of continuous.

And then the text tokens are going to be handled in the usual way. So they're going to be turned through a tokenizer into their vectors. And now they're going to feed this stuff into the transformer. And the idea is that you're training using a next token prediction. But you're going to do different losses, different losses for text. So text region is kind of easy. We're going to handle this with a simple linear layer and use a cross entropy, the same way we always train with a transformer.

But when we detect that we're outputting image tokens, we're going to process it through the corresponding unit up path and then through the variational autoencoded decoder to actually start generating an image. And then we're going to use the diffusion loss objective there. to actually train train that part

So we've got two different things going on here. And transfusion really is, if you actually look at the architecture diagram, it is like a latent diffusion model that's been pulled in half. It's got this kind of transformer in the middle. And then you've got both sides of, if you look at a stable diffusion architecture, for example, it's like that pulled apart with a transformer stuck in the middle with the text being handled in the usual way.

The very important thing here is how you do the attention masking when you train it. So the training objective of these are calls and attention masking for all the text. So if you're decoding a text token, doing the next token prediction, you can look at everything to the left.

For images, if you're generating an image token, you've got bi-directional attention. So you can look at all of the other tokens in the image when you're reconstructing one. This means every patch can attend to every other patch within the same image. but you can only attend to text or patches of other images that have appeared previously in the sequence. Does that make sense?

More complicated masking to set up, more complicated decoding regime. You need to know whether you're decoding text or decoding images. You've got the kind of unit and VAE structure around it. You've properly here got an amalgamation of the two architectures. It sounds very complicated, like very complicated to set up, very complicated to make work. But what they discovered when reporting transfusion results is really interesting.

Compared to the previous way of doing it, so actually quantizing all of your image tokens in their previous Chameleon series, they say that they're producing images of a similar quality after only training for about a third as many shots.

which is like truly impressive so that really suggests they're onto something here in the previous series of papers the chameleon papers they described having real difficulty actually getting the pre-training to work stably and they had to induce a number of kind of what they called architectural innovations which feel like kind of sticking plasters and changes to get it to train properly nothing like that was reported in the transfusion paper suggesting it had a kind of a smoother time of it

And then very interestingly, they found that on just text-to-text tasks, this new recipe was matching kind of the training losses that they saw in their previous series, the Chameleon series, at half the flop. So we're getting not only better images being produced but also better text. So it sounds like it's a really efficient recipe, just rather complicated to set up. And not much more to say about it other than this I think it's very similar to the one followed by the deep sea.

So that's kind of another recent model. It's fairly new. So late 2024, the transfusion paper. But this feels like this is something we're going to see a lot more exploration of these kind of hybrid architectures. And the team did some early experiments to ask if they can actually adapt this same recipe across new combinations of modality. Can they look at audio, for example? They did some small-scale experiments in the paper, and they suggest the recipe will work for those as well.

One thing I'm not very clear on is why ever quantize the image tokens in the first place? I recall reading about that some time ago and I was like, well, why? It just seems strange. Because if you do that, then you can simply train a single decoder transformer to produce both images and text documents. You've just got a larger codebook or two codebooks. And that means you can use much simpler training objectives. You can, for example, look at the cross-entropy for both of them.

You don't need all this other machinery in there to do the auto encoding and then the unit down sampling and then up sampling. I'm surprised that when they mentioned that they achieved the same training losses at half or a third as many of the flops, i thought it was really interesting because it sounds to me like with all the other machinery that actually there's going to be much more compute intensive to build one of these hybrid architectures like term transfusion

So I'm assuming that is all factored in when they talk about the losses seen at the number of flots. It is a more complicated model. I'm scanning through both of these papers and... It is remarkable on the chameleon one, which is the one that has the discrete or quantized image tokens. It is remarkable that the image outputs do look pretty good like you would think that does it say quickly how big the vocabulary size is of this basically in the paper

It's not that big. It's only codebook size of 8,192. And the... A 512 by 512 image gets broken into 1024 discrete tokens, which... Yeah, that's weird. I don't know. That's like, you've got 250... 1000 pixels, right? 500 times 500, you got four zeros with a 25 in front of it. So 250,000 pixels, 1000 discrete tokens, you've got 250 pixels. per token in this chameleon thing. 16 by 16 patches, then, by the look of it. So if you take 512 by 16, you get 32. Square that. You get 1024. So 16 by 16.

So when you think about it, 16 by 16 pixels, and then you've got 8,000 tokens to cover that, how many different variations of a 16 by 16 patch do I need? I would still think a lot more. I don't know. It just seems like each pixel is three colors with a range of 256. So just the number of possible colors for a single pixel is 256 cubed, right? And just from that alone, I would think, boy, 8,000 tokens to represent

A 16 by 16 little patch would just does not seem like enough. But I mean, it's hard to argue with the results. The images do look. remarkably natural i'm quite surprised that that works i would i would have intuited that it would have been just much like more artifacty that you would be able to see these like you know where these tokens come together, I would expect you would see the seams of this.

Generally speaking, I don't see them. That's quite crazy. They certainly reported for transfusion that they did some evaluation. benchmarks and compared it to DALI 2 and stable diffusion XL and they found that they were outperforming both of those models. and that they were reaching LAMA1 performance on text-only tasks. So that was their 7 billion parameter transfusion model.

and i don't know if this interpretation is warranted from this but it's really interesting to see such like fantastic performance i know these are you know this more recent architecture has proven a lot of tips and tricks they're using in training it which weren't used for SDXL or DALI 2. One also wonders what role has this mixed modality learning played in improving the ability of the model to generate images.

I'm not sure, but it's an interesting question given that we've certainly seen it happen the other way around. Yeah, the image editing on this is really impressive, too. I'm now switching over to scroll through the transfusion paper. And I think the demos of this that are most compelling to me. are basically instruct style editing. There was an image, there was a model

geez, when was this? I guess it was almost two years ago now. Instruct pics to pics was the one that basically you could give an image and give a command and it would attempt to edit that image according to your And I did one of an ultrasound image of my son, who's now a year and a half. And said, make it look like a newborn baby instead of an ultrasound. And it kind of did that. Not too bad. It was a funny experiment anyway. Those were limited, to say the least.

in terms of what you could do and how good the quality was and how much fidelity it actually had to your instructions. It was kind of all over the map. Here, I'm like, man, this is exactly what we need for the Waymark application because We're talking very precise local edits that are... not changing the overall composition, not making the image feel like it's a totally different image, but doing the sort of precise cleanup that you might see from

an actual pro image editor doing it in a Photoshop or what have you. Yeah. So things like, you know, a couple examples here that stand out to me most. One is change the graffiti on the side of a truck into calligraphy writing. And the before and after is just amazing. I mean, you see the one, you know, before is like

tagged, of course, with graffiti. And then it's just like, man, it's perfectly in situ calligraphy that's been printed onto the side of the truck. Examples of removing things, you know, replace one object with another, change the color of this thing. This is pretty impressive. Yeah, it is pretty impressive. This has never been released, right? Do you know why they haven't released this? No. To my knowledge, none of the series has been released. And I'm not sure why.

One does wonder what image sources they used for the different models. They do say, at least in some of the papers, they used only kind of open source or publicly available image sets. I don't think that's true for all of them. I'm not sure for everything they use. They may also, they're text to text. They may also have some in-house text to text kind of data sets that they used, which for whatever reason they don't want to release, but I'm not a hundred percent sure.

But you're right, it certainly beats the Spaghetti's Web comfy UI canvases that I end up building to do image editing, just be able to write it in a single sentence. Yeah, you can really see the future here, I think, and in some of the GPT-40 demos as well. It's funny, people ask us all the time about, with Waymark application, Do you use Stable Diffusion or DALI or whatever? And we've actually found that those things are not very useful for our users because

They aren't realistic to, you know, small business wants to present itself in a positive light, but in like a realistic light. They don't want it to feel like, wait, this is nothing like what I saw on TV when you actually show up. The lack of control and the difficulty of grounding the purely generative models has been a real challenge. And you, of course, can do image prompting, but a lot of times those also haven't worked super great.

noise your input image and then take it in kind of a different direction. And it's like, I actually wanted something that was more like local specific, like maintain the integrity of this, but change it in this one, you know, very specific way. And that has been hard to do. I also can imagine how

There's other techniques now that are popping up for this too, but character consistency has always been a real challenge for people that are just trying to create original content. The Waymark Creative Team has done some really cool stuff. We did an episode. I hope we have another one soon, but we did an episode on... basically a short film that they made with all Dolly 2 images. This has been a while. Now there's a part two that uses newer models and all sorts of new techniques.

Character consistency was a huge problem in those early models. Scene consistency was another huge problem. I mean, they came up with elaborate prompting techniques and all sorts of ways to try to... get around that and i think did you know a really remarkable job with what they had at the time but when i look at this i'm just like

Man, a lot of it falls out of this very quickly where you can just say, change this, do this, put this guy in a different scene. Next thing you know, you're kind of off to the races on a lot of the different... things that you want to do that have been So, yeah, presumably that'll be coming at some point from an API provider, GBD 4.0 or otherwise, or maybe they'll finally get around to releasing this, but it is.

I noticed too, Lily Yu, who's a former guest, I had her on to talk about Megabyte. It was called basically a byte level transformer. And... Interesting to see her name pop up on another one of these kind of continuous space transformer projects. Yeah. This has been a fantastic deep dive and walkthrough. Where does this leave us now? Well, maybe we could just wrap up just with a quick summary of...

what the frontier labs are offering and what's winning on the different benchmarks. So I found I had to compile this data from a number of different sources because it wasn't all available all in one place.

But if one is building with one of these models and you know that you want something that's well on MNMU or, for example, Doc VQA, so that's, you know, extracting information from images, or maybe even if you're interested about what does one on blink i tried to compile this into a simple table and i can just walk you through now what i think is is winning in each case Cool. Yeah, this is really useful.

So what I looked at, I looked at the Grok 2 beta. I looked at the latest version of Claude's Sonnet, 3.5 new. Gemini 1.5 Pro because I was unable to find results yet for Gemini 2. 401 preview, I could only get one result for, and that was the MMMU. But 40, I could get results for all of them from. And then just a couple of open source models as well.

So we look at leaderboard here and we say, okay, what is currently doing best on MMMU if I really cared about reasoning over images? And 401 is standing head and shoulders about everything else with 78% there. And if I wanted to look at what was next down from that, we would find Claude, 3.5 Sonnet New. And the intern VL 2.5 model, which is currently available, I believe through an API, we find those down at 70 and a bit percent each.

So Gemini 1.5 Pro is coming a few percentage points below those two. So that's kind of where we're sitting on MMMU at the moment. Be interested to see what Gemini 2 does there. But the Grok 2 Beta is sitting at 66% on MMMU. One quick note on that too, because this intern, I keep forgetting as we've gone through this, who made this model? The OpenGV Lab.

Yeah. So it's another, I mean, I think we touched on this a little bit earlier too, but it's another good reminder that the Chinese models, the Chinese labs are not super far behind. They are in this case. right there, head-to-head with Anthropic, five months earlier. The Lava One Vision model is from ByteDance. The Quen model is from Alibaba.

Yeah, they all perform very well. And this is not, yeah, the reason I mentioned the five-month earlier thing too is because sometimes people will be like, oh, well, they're, you know, doing that they're starting with you know an american thing or they're you know they're starting with llama and not acknowledging it or they're you know trading on the outputs of the

American models, whatever the sort of rationalization often is, and I'm sure sometimes that stuff is happening. But here, It is still as of now, and I guess we've got caveats around like 03 and whatever and full 01 where not all the data points are available yet, but it is striking that. this Shanghai AI group has. three tenths of a percent lower than quad three five sonnet new a full five months ahead of time. So they're definitely not

Definitely not training on clawed outputs to achieve that. And that is higher than GPT-40 by just a point. But still, it's like, I think it is very... Much worth keeping in mind that the gap here is basically zero. And arguably, you could even say, if you squint at it, you could even say the Chinese labs are maybe a little bit ahead. But I would say...

Yeah. Overall rounding, you should probably say they're about the same. Yeah. We also note as well that the parameter counts in their models, all around the 70 to 80 billion parameter mark. and presumably they're going to build larger and larger, and presumably they'll get better and better as a result.

but um you know lots of what we talked about earlier in the podcast about how fast the training recipes and the data sets are developing you know it probably goes to account for a lot of this just the more recent the model like they get better every single time because people are learning so much about how to effectively do this. I think it's also interesting to look at the DOC VQA scores here. All of the models that I mentioned are doing above 90% on that benchmark.

The best I can find is, again, it's the QN2VL, QNVL2 model, which is currently doing 96.5%, so that's open source or proprietary. model. I don't have results for the new Claude 3.5 or 4.4.0.1 on Dock VQA, just to point that out. But that model is doing better, for example, than Grok 2, it's doing better than Gemini 1.5. and GPT-40 on docvqa so kind of leading there and Blink is very interesting because we haven't got results for everything here but

kind of at the top of the leaderboards, we've got 4.0, which has about 63.2%. I have seen a result that's higher than that, but I've seen three results for it. Two of them were at 63.2% on Blink. So I'm going to go with that one. The OpenGV Labs in TurnVL 2.5 is clocking in at 63.8, so doing ever so slightly better. Gemini 61 cents and Claw 3.5 Sonic new. That's 56.5%. So those are the best and most up-to-date scores I can find on those three benchmarks.

It's also worth just looking at the mini class here. So here I've been kind of the flash and mini versions of some of these models. They're obviously a little bit lower than most of the benchmarks, they're smaller models, but it's interesting to know, if I was to pick one up, what would I use? If we look at MMMU, we've got results for Gemini 2.0 flash, which is clocking in at 70.7%. And that makes it stand head or shoulders above everything else in the mini class.

by which I'm including 4.0 mini, the intern VL 8 billion parameter. I haven't got result for 4.0.1 mini. Grok 2 Mini is at 63.2%. So best bet if you want reasoning over images would seem to be Gemini 2.0 Flash at the moment. Doc VQA got very few results here, but the intern VL 2.5, 8 billion parameter model is clocking in at 95.1%, which seems pretty damn good to me.

Interesting enough, that's exactly the same result as it gets in its full 78 billion parameter form. Blink, we're just going to finish on Blink, the mini class of models. the best score i can find and i couldn't go on for grok for gemini 2.0 or for 401 mini but i could find it for gpt40 mini and that's clocking in at 51.9 percent which means that the intern VL 2.5 model is getting slightly better at 54.8%.

The surprise winner coming in from left field in the mini class is microsoft's fi or fee depending how you want to pronounce it 3.5 vision model which is a four billion parameter class model and scoring 58.3 on the blink benchmark not doing so well on MMMU, doing really well on Blink, which is really interesting. So I combed the technical report for the vVision model. And just a couple of interesting details from it and not enough to make conclusions on.

For a 4 billion parameter class model, they used half a trillion pre-training tokens from the mixed pre-training dataset. And that's a big data set for a small model. We think about the intern VL 2.5 model, they said they trained on 128 billion tokens. The QN2 model, 1.5 trillion tokens, but those are all many times the size. So it's a fairly large training dataset for a small model.

The other really interesting thing was that they mentioned their SFT dataset, which they described as a combination of datasets from a significant component of which was built in-house by Microsoft. So this is instruction fine-tuning data. That's an extremely large data set, especially for a small model. I mean, I can't recall seeing one that large anywhere else in the research.

And they also mention performing DPO, which only a couple of the other models, like ARMA 3V, explicitly mention a DPO stack. So not sure why it does so incredibly well, but anecdotally I have seen plenty of commentary online with people saying, wow, Fee 3.5 model is really good at visual understanding. So Really interesting. Perhaps something we should have dwelt on a bit more in the podcast.

And that is about everything that we wanted to cover today. I mean, we've kind of told the story of the last couple of years of VLMs. What are we going to see in the future? I think we'll see a lot more of these kind of true multimodal models following the transfusion recipe. I do expect to see the scale parameter count of open source VLMs increase further, particularly now as we know about this kind of progressive. upsizing of language model backgrounds work so well.

We should expect to see more innovation in either pre-training of vision transformers, or maybe even their replacement at some point. And a continuation of this kind of... production of new fine tuning datasets that contain programmatic or human augmentations. So I would, for example, expect to see new fine tuning datasets there.

What we haven't seen too much of as well is, we just mentioned DPO at the very end there, not too much exploration of the role that alignment post-training can play in vision language models. So I expect to see that explored as well. And that concludes not quite so whistle-stop tour through the last two years of Vision Language Model. Amazing. Well, the depth of research that you put in to make this possible is

outstanding and definitely much appreciated. And I learned a lot from it. So as you know, that's how I tend to score these episodes for myself. And I come away definitely with a much better understanding of the various options and strengths and weaknesses and even a few prompting techniques along the way. How does this relate to what you typically do? And maybe in the last couple of minutes, just tell us a little bit about like your normal work and the sort of stuff you do commercially.

Yeah, so a lot of this is relevant to some of the work that I'm doing. So Veritai is the small consultancy that my colleagues and I set up a couple of years back. We do a lot of AI strategy work. So before that, we used to do a lot of data science and analytics strategy in my background, and we've moved on to thinking about

How do you develop and build an AI strategy? And especially if you're an SME. So I think that's our sweet spot. Although what we do does work, I think, in larger organizations, if you're kind of working at the department level. So we do a lot of that, but we also do a lot of prototyping work for people. So proof of concepts. You have some ideas, you know, in your strategy, things you want to follow, things the company wants to try out. So we have a very kind of structured way of...

performing these experiments cheaply and finding out what's easy and what's difficult. Because as you probably know, a lot of working with AI today is your mileage may vary. So we kind of know what the best practice is, but it's very hard to know from the outset whether you're going to get great results given the data or compute constraints that your client may have.

And so we've kind of developed this rapid experimentation methodology. But in some of the domains we're working on, and we do some in medicine, particularly looking at medical terminologies and how they get used by language models and how language models can work with them.

but also and i've spent a lot of time looking at open source intelligence for a few clients recently and there a lot of kind of the data that we're trying to interpret is multimodal in nature so there's lots of trying to understand you know how does this image seem to correspond with this claim that's being made you know can we infer who might be in the image or what the image is about given that there's some surrounding text or context we understand

And so it's in that domain that we've certainly been working with some of the VLMs here. Sort of reflections probably of some common experiences that we've had. I don't do a ton of this sort of stuff, but occasionally, you know, mostly I've done it just for my own company, which I've mentioned a dozen times already.

but occasionally I'll take on a project for somebody that asks me for help with something and It's an interesting... juxtaposition often of like and I'm sure you've found various versions of this a lot of times you can like answer somebody's question really quite quickly and I'm always reminded of Tyler Cowen's answer for when he's asked how long it took him to read a given book

because he's a famously fast reader. People ask him that question and he always answers with his age. And I think this tour through vision language models is a good reflection of how much... depth and sort of obsessive quality of research has to go into the ability to then

turn around quickly and be like, I think I know what to do here for your random situation. And I think we can put together a proof of concept on that pretty quickly. But that's because you've done this like, extensive exploration of everything that's out there and have a really already fine-tuned intuition for which direction to go yeah the way we do strategy projects relies on this a lot as well it's like you have to come with prepared minds you have to come having

been immersed and soaked up everything that's happening. so that when you see the you know when people say this is the problem that exists here you can pattern match it to something. And, you know, I am sure like you, I've got like a big archive, searchable archive, like my own personal kind of rag index that I maintain at home of just everything that I ever come across, papers.

newsletters, sub stacks, whatever it is, sits in there and I just try and pattern match it so that I have some grasp of, okay, this is how we're going to suggest the options for how you might solve this problem, A, B or C. And yeah, but you do need to be steeped in it, I think, in order to then spot the opportunities to use the technique, the technology, the model, whatever it is, when you actually see it in practice. Yeah. Cool. You want to tell us where we can find you and the company online?

Yeah, you can find the company website at Veritai, which is V-E-R-A-T-A-I.co.uk, Veritai.co.uk. Or you can find me on LinkedIn, Will Hardman, and I... I'm writing a fair bit at a moment about AI strategy, and next year we'll be writing about various other things, and maybe about vision language modules as well.

Well, I'll be sure to connect with you there and encourage the audience to do the same. This has been a fantastic walkthrough of visual language models. I know a lot of work has gone into it, but if you want to tackle another topic like this, I would love to do it. Yeah, I will say thank you for this one. And officially, Will Hardman from Veritai. Thank you for being part of the cognitive revolution. Thanks.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android
Open in Metacast