Abstracts: July 18, 2024 - podcast episode cover

Abstracts: July 18, 2024

Jul 18, 202412 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Senior Researcher Arindam Mitra introduces AgentInstruct. Using raw data sources, the automated multi-agent framework can create diverse, high-quality synthetic data at scale for the post-training of small and large language models.

Read the paper

Transcript

GRETCHEN HUIZINGA

Welcome to Abstracts,  a Microsoft Research Podcast that puts the   spotlight on world-class research in   brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or   a podcast abstract—of their  new and noteworthy papers.

[MUSIC FADES]

GRETCHEN HUIZINGA

I'm here today with Dr. Arindam Mitra, a  senior researcher at Microsoft Research   and the lead researcher for Microsoft's Orca  project. Dr. Mitra is coauthor of a paper   called “AgentInstruct: Toward Generative  Teaching with Agentic Flows.” Arindam,   it's a pleasure to have you on Abstracts today.

ARINDAM MITRA

Thank you, Gretchen.

HUIZINGA

So let's start with  a brief overview of your paper.   What problem does your research  address, and why does it matter?

MITRA

So the post-training phase is very  important for language models. You can really   improve the model a lot by creating high-quality  synthetic data. The problem is, however, though,   high-quality synthetic data creation requires  lots of human effort and expertise. The problem   that we're trying to tackle is, how do you reduce  human effort? How can you create high-quality data   with really low amount of human effort? When you  have a language model and, let's say, you want  

to apply it somewhere, you might have to train a  generic model before. Which could be small or big.   Doesn’t matter. After that, you can specialize  it on the domain that you are looking for, and   when you want to do that—to make it really fast,  this particular process—it's best if you go for   synthetic data. If you have a way to, actually,  generate very high-quality synthetic data,   you can fast-track this part of specialization  process. Not only single model. So this year,  

you're going to see a lot more multi-agent  models. And when you are trying to build these   multi-agent models, you're fearing like, OK, it  might increase the cost too much, the latency too   much. So it's also very much important that you  have a multi-agent system and you can, sort of,   replace some of those agents with specialized  small models. And when you're trying to  

address these goals, you want this process to be  something which you know works fast. So that's why   we are trying to make sure we have a very good way  to create synthetic data for your specific need.

HUIZINGA

No research exists in a vacuum, and  most of it fills some kind of a gap. So tell us   what's already been done in this field  and how this work is building on it.

MITRA

So previously, actually, we have seen  that in post-training, the more data you have,   the better the performance goes for the model  you're training. So what we wanted to test is how   much we can scale and what happens if we scale a  lot and lot. But we didn't have the tools for it.   So the other approaches people previously used  was you had a small set of data and how do we   expand this dataset into much larger and larger  amount of data. That's where people were mostly  

focusing. But it's not that easy to create that  initial seed set. [LAUGHTER] You need to be very   expert. The way that we're doing is, actually,  rather you define what you want to create. Like,   OK, you want to create tool-use data.  So you say, OK, I have a bunch of tools,   and I am looking for data in the scenarios where  someone can just come give me a description and   then maybe that person interact with the AI to  figure out how to get the job done. It's not a  

one-step thing. And maybe you also have a setting  where it's more like an app developer. You have   a bunch of APIs in your phone. You just want to  figure out which one is best for the user request,   which came through voice command. So  different scenarios could be there. So   what we're saying [is], OK, we are not going  through the method where you have to come up   with your initial own seed data and then we  expand. It is more like you define what you  

want to do. It's much more abstract. And then,  we are, sort of, automating the effort of data   creation. So this setting actually of synthetic  data creation, we are referring [to] it as   generative teaching, and that's where we  are, sort of, differing. So previously,   it was more like expansion, and now we are trying  from specification to the data that you need.

HUIZINGA

Gotcha. Well talk a little bit more   about your methodology and how you  went about conducting this research.

MITRA

So first of all, what we are proposing  actually is a multi-agent solution. So you start   with first describing what you really  need. So you describe in detail, like,   I need data for this specific skill or this  specific scenario. Then, what we do is like,   OK, you have some unstructured data or raw data  like text documents or code files that you gather   from web with permissible license or use something  that you own. We don't care much about what the  

content is really. So it's more like we got some  random stuff, some random content. And then we'll   guide you how to convert this random something  which is not meaningful for you into something   which is meaningful for your data creation.  For example, like, if you are creating data   to teach how to use APIs, you might think about,  you need lots of APIs and how do you get these   APIs. So what we are saying is, like, we can take  something like code and we'll have agents which  

will convert these raw code files into list of  APIs which is more like a library. So you create   automatically this input that is very meaningful  for data creation. And then once we have that,   we have basically the seed instruction creation  step based on your specification. Like, what   do you want to create data for? So you have all  these different scenarios, and we have multiple   agents creating data for different scenarios.  And then the last step is actually what we  

call refinement step. So it's more like whatever  data you created, we’ll go through them and we’ll   make them better and better—improve the quality,  improve the complexity, improve the trickiness,   we’ll teach when not to answer, etc., etc.  So make sure we cover the whole space. So by   changing the stochastic seed, we are trying  to cover the entire possible data space.

HUIZINGA

Right.

MITRA

So that's the key thing. The way we, sort  of, conducted this research is actually we defined   17 skills. Skills meaning reading comprehension,  tool use, text modification, content creation,   RAG (retrieval-augmented generation) ... we have,  like, list of 17 skills … conversation … and then   we created one multi-agent flow for each of the  skills and we generate data. So one key thing I   want to highlight is, like, this work, compared  to other work, it was not benchmark driven. We  

want to teach a skill. We don't care which  benchmarks we're trying to evaluate it on.   So we define the skill, like tool use means this  to us, reading comprehension means this to us,   text modification means this to us. And then we,  sort of, generate the data to teach everything for   that skill. And then what we did, we created  actually 22 million instructions. And we had   previously in Orca series, we had 3 million,  around, instructions. So the 25 million is what  

we, sort of, have at the end. And that's where  we actually trained a Mistral model as of now.   And we're going to measure, like, how much we  improve the Mistral model by this post-training.

HUIZINGA

Moving from methods to findings,   I always look forward to the part of the  research paper that finishes the sentence   “and what we found was … ,” so give us a quick  overview of your results. What did you find?

MITRA

Yes, so the results were actually very  exciting for us. So Mistral 7B was our main,   sort of, baseline because that's  where we’re trying to showcase, like,   how much improvement we are getting. On the other  side, we have, like, frontier models—ChatGPT,   GPT-4. We want to also measure how far we  are from those frontier models, so that's,  

sort of, our evaluation setup. So on average  actually, we got like 20 percent performance   gain over the Mistral, and we evaluated that  across 14 benchmarks that test reasoning,   content creation, instruction following, format  following, etc. But what was more important to us   was to do a skill-specific evaluation because we  are trying to teach certain skills, and we had,  

like, 17 skills as we mentioned earlier. So, for  example, like, if you are focusing on reading   comprehension as a skill, we took LSAT, SAT,  and DROP, and many other benchmarks; we created   a collection of reading comprehension-based  benchmark. And there, we are observing, like,   20 percent improvement over Mistral, and what it  means, like, we're actually achieving GPT-4–level   performance. Similarly, if I'm focusing on math  skill, there are many datasets which test, like,  

elementary math, high school math, college-level  math. And we improved actually across all these   different levels of math. So we see from  40 percent to 150 percent of improvement   on different benchmarks of math. So it was  more like what we wanted to see. We're not   optimizing for a particular benchmark. We  wanted to optimize the skill, and that's   what you're observing. So you're observing  improvement in math across all these levels,  

from elementary to high school to college to  middle school, etc., everything. The same goes   for RAG, as well. We’re observing on RAG skill  92 percent, around, improvement over Mistral. The   format following numbers are pretty interesting  to us. So format following is very important for   SLMs (small language models). You want to make  these models practical. You want to make sure   that they follow the format so you can parse  the result. And we were able to take Mistral  

beyond Gemini Pro. So that was a very strong  performance from the post-training that we   did. For summarization, actually we were able to  reduce the hallucination rate by 31 percent while   achieving the GPT-4–level quality. So overall,  all these results were, sort of, highlighting   that the methodology that we have, which we're  calling AgentInstruct, is very promising.

HUIZINGA

I think it's important to  get practical and talk about real-world   impact. So tell us who you think this  research will benefit most and why.

MITRA

Yeah, so again the model builders  will, sort of, find it most beneficial. So the   significance of our work actually lies in the way  we are trying to revolutionize the language model   development through scalable, low-effort synthetic  creation. And the scalable and low effort is,   sort of, the key thing, right. We have shown  that we can create very high-quality data.  

That's what the numbers are telling us. We  want to mention that this is very scalable   and low effort, and that's what we think  might help the most for model builders.

HUIZINGA

So, Arindam, let's borrow a phrase  from the machine learning lexicon and go for   a little one-shot learning here: if you had  to boil down why your work is important,   what's the one thing you want our  listeners to take away from this research?

MITRA

The key takeaway would be, like, the  AgentInstruct method enables the generation   of vast, diverse, and high-quality  synthetic data with very minimal human   input. So that's one thing I would,  like, to remember from this paper.

HUIZINGA

So as we close, talk briefly about  the limitations that you encountered in this   project and directions for future research. What  are the outstanding challenges in this field,   and what's on your research  agenda to overcome them?

MITRA

Yes, so we're exploring further automation.  But apart from making this data creation more   automated and less human involvement needed,  we're trying to focus on two other aspects. One   is automated model debugging, and the other is  automated model repairing. So now that we have   the ability to generate data for a particular  skill, let's say math, for model debugging,  

what we need is basically an error handler.  Like something we can plug in which takes   the question and the answer coming from a  different model and verifies if the answer   is correct or not. So that's the part we're  working on right now, figuring out this error   handler. And the second aspect is repairing.  So once we have the error, we figure out, OK,   this is where the model is struggling. How can we  give feedback or how can we give more knowledge so  

it can basically correct those errors? So those  are some things we're working on right now. [MUSIC PLAYS]

HUIZINGA

Well, Arindam Mitra, thanks for  joining us today, and to our listeners,   thanks for tuning in. If you want to read this  paper, you can find a link at aka.ms/abstracts,   or you can find a preprint on arXiv.  See you next time on Abstracts!

[MUSIC FADES]

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android