Abstracts: May 6, 2024 - podcast episode cover

Abstracts: May 6, 2024

May 06, 202414 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Researcher Michel Galley explores how he and fellow researchers combined new and existing data to create MathVista, an open-source benchmark for measuring the mathematical reasoning capabilities of foundation models in scenarios that involve text and images.

Read the paper

Get the code & dataset

Transcript

[MUSIC]

GRETCHEN HUIZINGA

Welcome to Abstracts,  a Microsoft Research Podcast that puts the   spotlight on world-class research in brief.  I’m Dr. Gretchen Huizinga. In this series,   members of the research community at  Microsoft give us a quick snapshot—or   a podcast abstract—of their  new and noteworthy papers. My guest today is Dr. Michel Galley, a senior  principal researcher at Microsoft Research.   Dr. Galley is the coauthor of a paper called  “MathVista: Evaluating Mathematical Reasoning of  

Foundation Models in Visual Contexts.” Michel,  thanks for joining us on Abstracts today!

MICHEL GALLEY

Thank you for having me.

HUIZINGA

So I like to start with a  distillation or sort of an elevator   pitch of your research. Tell us in  just a couple sentences what problem   or issue your paper addresses  and why we should care about it.

GALLEY

So this paper is about  evaluating large foundation models.   So it's a very important part of researching  large language models because it's a good way   to evaluate, kind of, the capabilities—what  these models are good at and not good at. And   a part of the focus of MathVista is to evaluate  these large foundation models in a multimodal  

setup, so when the input to the model is actually  not just text but also text and images. And then,   an example of a task that such a model  would perform is, like, the input is   maybe a mathematical question, and then there's  some visual support to that question, let's say,   of an image of a graph, and then the model has to  respond to something related to that. And why this  

is important … there has been a lot of work, of  course, on large foundation model. Especially when   it comes to reasoning tasks, like mathematical  reasoning, a lot has focused more on written form.

HUIZINGA

Yeah …

GALLEY

So MathVista is one  of the very first datasets   that has input that is both images and text.

HUIZINGA

Yeah, yeah. Well, reading your  paper, it seems like this is an area that   hasn't been studied systematically.  In fact, you actually say that!   And say that the field is largely unexplored. But  quickly tell us what has been done in this field,   and then tell us how your research addresses  the proverbial gap in the literature.

GALLEY

Well, there has been a lot of work  on vision and language in other problems,   like not just about reasoning. Maybe let me just  mention why reasoning is important. So one reason   I think it's very interesting to evaluate these  large language models in terms of reasoning skill   is that we evaluate their capabilities beyond  just memorization. So as many of your listeners   probably know, these large foundation models  are trained on large amounts of text that is  

public data from various sources. So when you  ask a question to a large foundation model,   it could be the case, in many cases, that it  just memorizes things it has seen in the data. So what makes it interesting in  terms of reasoning, the answer oftentimes   is not there in the data. So it needs to  develop this ability to connect the dots  

between various pieces of information  to come up with a new answer. So the   focus of our paper is really on mathematical  reasoning, but it goes also a bit beyond that   because what is also represented in the  data is also science question and so on.

HUIZINGA

Yeah …

GALLEY

So this reasoning part has largely  focused, until MathVista, on text-only modalities. So it's one of our very first ones  that combines text and images in terms of   evaluating these large foundation models.  So you ask about what was done before. So,   yes, there has been a lot of work, text only,  on reasoning, for example, the mathematical   question that's just based on text. And there  has been a different stream of work that was  

much more focused on vision. A lot of work has  been on tasks such as visual question answering …

HUIZINGA

Yeah …

GALLEY

… where basically, you have an image  and the question is about answer a question   about this image. So, yes, we’re trying  to fuse the two lines of research here.

HUIZINGA

Right …

GALLEY

And that's one of the  first works that does that.

HUIZINGA

Yeah. Well, let's talk about  your methodology for a minute. Tell   us how you went about conducting this  research, and what methods did you use?

GALLEY

Yes, sure. So that's a bit different  from a typical, kind of, machine learning   paper because the focus on this work is really on  benchmarking on the dataset. So the methodology is   more about how we collect the data, process it.  So they have two components to doing that. One   was to look at existing data that already combines  vision and text. And there are existing datasets  

that are actually already fairly big but that were  not focused on reasoning. So we use those existing   datasets and look for instances in the data that  actually include some mathematical or science   reasoning. And so that part is leveraging existing  datasets, but the important part is, like,   we really want to carve out what was interesting  piece in terms of reasoning. And we had different  

stages of processing the data to identify the  subset that was reasoning-based. So one first   step was basically to apply some automatic filter  to determine whether or not a given example, let's   say something that is visual and text, is actually  … involves some mathematical reasoning. So we have   different strategy. For example, if the answer is  numerical, it's likely that it might be something  

mathematically related. But that's just the first  stage. And the second stage, we actually had   humans, annotators, just certify that the selected  data is actually of high quality. So we do have an   example of, “Oh, this is mathematical, and that's  either mathematical or scientific,” and so on.  

And that's one part of the effort. The other part  is that we realized while we collected the data,   there are certain types of mathematical reasoning  or related to mathematical reasoning that were   not represented in the data. So we created three  new datasets as part of MathVista. So when I said   dataset, it's more like, think of MathVista as  like an aggregate of different types of data, and  

we added three of them, three new types of data.  One is what you call PaperQA, which is basically   data that is collected from scientific papers on  arXiv, and that had questions asking about that   paper and that included some visual components  from the paper, typically a plot or a figure.

HUIZINGA

Yeah …

GALLEY

And then we had IQTest,  which is basically, I mean,   it's vaguely related mathematically,  but basically it also, kind of,   tried to see maybe more abstractive thinking about  maybe some input that is both text and visual. And   the final is about FunctionQA, that is basically  algebraic reasoning and function plots and so on.

HUIZINGA

OK …

GALLEY

The important part was actually  to identify among vast amounts of data   what is actually very interesting  in terms of mathematical reasoning.

HUIZINGA

Yeah …

GALLEY

So that part, I think,  was quite a big part of doing   that work—finding existing data  but also creating new data.

HUIZINGA

Yeah, yeah. Well, my favorite  part of a research paper is where it says,   “and what we found was … ,” so talk a little  bit about your results. What did you find?

GALLEY

So we evaluated a wide variety of models,  including GPT-4, Claude 2, GPT-4V, multimodal   Bard, and LLaVA, and we categorized them into  three categories. So one is text only. So,   basically, you take a model that is by default  just text, and we give it the text part of the   question and ask it to answer the question.  Of course, that's, kind of, a bit of a, it’s   a difficult task because oftentimes [LAUGHTER]  we crucially build these questions so that you  

have to rely on the vision part. But that's for,  you know, scientific investigation to know how   well they can do, and so that's one category of  model. A different category is still text only   but that is given the detection from the image. So  on the image, we do OCR. So we convert those words  

from images to text. It’s kind of an extension of  the text-based model, except that what was images   is translated into text, and then the input to  the model is word only, and that's a different   category of model. And the third one is basically  truly multimodal model. And what we found, I mean,   not surprisingly, it’s, kind of, the one that was  doing most poorly is the one that is text only.   The second is text plus OCR. And then finally,  the one that does best is the multimodal like  

GPT-4V. But while the ordering between these three  categories makes sense, it was a bit surprising   that maybe the gap between multimodal and text  plus OCR was not bigger. Well, it’s big, but maybe   not as big as we were expecting. So, for example,  the best detection from the images model achieved   like 35 percent accuracy while GPT-4V was 50  percent. So it's a substantial gap but not huge.

HUIZINGA

Right. Just to clarify, you're  saying OCR. What does that stand for?

GALLEY

[Optical] character recognition.

HUIZINGA

Gotcha.

GALLEY

So, basically, it's the task of taking  text, sometimes typed, but sometimes written,   and convert this into the actual text  like you would have in a text file.

HUIZINGA

Right. Michel, does any of this  have to do with the difficulty of the   math problems that you present these  models with? I mean, it seems to me,   similar to humans, that the easier  the problem, the easier it would be   for the machine. So at what level of  math are we talking for these tests?

GALLEY

What's nice about MathVista is there's  continuum [of] different difficulties. So the   spectrum is quite broad, going from elementary  school to more advanced concepts such as   calculus. So it's quite broad. So in the paper,  we do have this, kind of, broken down by level.   So the number I gave you, like 50 percent, is  an aggregate over all the difficulties. But …

HUIZINGA

Gotcha.

GALLEY

But the goal there was really,  kind of, to compare different models,   but we do have a fair amount of  analysis in the appendix. Actually,   we have 100 pages of appendices of plenty of  analysis and so on. So if people, I mean …

HUIZINGA

I saw that. I saw the  length of the paper, and I'm going,   what? [LAUGHS] That’s a LONG paper! Well, research  in the lab is one thing, I always like to say,   but understanding real-world impact  is important, too. So where's this   work going to make the most difference,  and who does it help most at this point?

GALLEY

Well, I think perhaps that's the  main point of this kind of line of work   in terms of reasoning is that when looking at  this difficult problem that are mathematical,   actually it's a way to, kind of, abstract  away maybe more complex capabilities,   and I think while thinking just about  mathematics might seem a bit narrow,  

I don't think that really is. It's more about  seeing whether this model has the ability to do,   kind of, multistep kind of processing  of your input and think maybe somewhat   intelligently about a given problem. So we  focus mostly on math. There is some science,   but we would be very interested, especially  in future work, to, kind of, go beyond that.

HUIZINGA

OK, well, let me press in a little   bit there because … just say I'm a  regular person using a GPT model.   Is your work more addressed upstream from that to  the research community to say, how do we get these   models to be better so that downstream people  like me can be more confident of the models?

GALLEY

Yes, I would say at the moment, I mean,   this line of work is perhaps more geared  towards somewhat more research community,   but I think it could be some seed for researchers  to think about some applications perhaps that   also requires some kind of step-by-step  reasoning but perhaps not going beyond math.

HUIZINGA

Yeah. Michel, if there was  one thing you wanted our listeners to   take away from this research, kind  of golden nugget, what would it be?

GALLEY

Well, I would say it’s the challenging  part of these datasets. I think that's what   makes MathVista stand out compared to other  datasets. By now, there are a few other vision   and language datasets, and of course, many  that are more text-based. And we've seen,   for example, some recent papers showing  that actually MathVista remains one of   the most challenging ones. So I think  it's probably going to stay around for  

a while because of the difficulty it  represents. So it's open source of   available datasets that everybody can use,  and I very much encourage people to use it.

HUIZINGA

Is it on GitHub?

GALLEY

Yes, it's on GitHub.

HUIZINGA

So what's next on the  research agenda for helping LLMs   get better at math, Michel? What are the  big challenges in the field yet? I mean,   you've alluded to many of them already, sort  of, but what's next on your research agenda?

GALLEY

Well, I would say what we found so far  is these models are very good at processing the   textual part of problems it's given, to the model,  but you have the equivalent in images actually   harder somehow. So I think a lot more work needs  to be done in terms of vision capabilities,   in terms of reasoning over images, because the  capabilities you will see in text are actually  

quite advanced, whereas the equivalent in images  doesn't seem that good. I mean, a fair disclaimer:   my background is more on the text side, [LAUGHTER]  so some of my colleagues on the paper are more   on the vision side, so maybe if a listener maybe  run into some of our coauthors at the conference,   they might want to talk to these vision people  because that's less of my background. [LAUGHS]

HUIZINGA

Well, and if you think  about Venn diagrams, you know,   you've got people that are doing  text, people that are doing vision,   and then the people that are trying to  do both to see how the worlds collide.

[MUSIC]

HUIZINGA

Well, Michel Galley, thanks for  joining us today. And to our listeners,   thanks for tuning in. If you want to read this  paper, you can find a link at aka.ms/abstracts,   or you can find it on arXiv. You can also read it  on the website for the International Conference   on Learning Representations, or ICLR. And if you  happen to be at the ICLR conference this week,   you can hear more about it there.  See you next time on Abstracts!

[MUSIC FADES]

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android