Abstracts: NeurIPS 2024 with Weizhu Chen - podcast episode cover

Abstracts: NeurIPS 2024 with Weizhu Chen

Dec 06, 20248 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Next-token prediction trains a language model on all tokens in a sequence. VP Weizhu Chen discusses his team’s 2024 NeurIPS paper on how distinguishing between useful and “noisy” tokens in pretraining can improve token efficiency and model performance.

Read the paper

Get the code

Transcript

AMBER TINGLE: Welcome to Abstracts, a Microsoft  Research Podcast that puts the spotlight on   world-class research in brief. I’m Amber Tingle.  In this series, members of the research community   at Microsoft give us a quick snapshot—or a podcast  abstract—of their new and noteworthy papers.  [MUSIC FADES] Our guest today is Weizhu Chen. He is vice  president of Microsoft GenAI and coauthor  

of a paper called “Not All Tokens Are What  You Need for Pretraining.” This paper is   an oral presentation during the 38th annual  Conference on Neural Information Processing   Systems, also known as NeurIPS, which is  happening this week in Vancouver. Weizhu,   thank you for joining us today on Abstracts!

WEIZHU CHEN

Thank you for having me, Amber.

TINGLE

So let's start with a brief overview   of your paper. In a couple sentences, tell us  about the problem your research addresses and,   more importantly, why the research community  and beyond should know about this work.

CHEN

So my team basically in Microsoft GenAI,  we are working on model training. So one of the   things actually we do in the pretraining,  we realize the importance of the data. And   we found that actually when we do this kind of  data for each of the tokens, some token is more   important than the other. That's one. The other  one actually is some token actually is very,   very hard to be predicted during the pretraining.  So, for example, just like if someone see the text  

of “Weizhu,” and what's the next token? It can  be “Chen”; it can be any of the last name. So   it's very hard to be predicted. And if we try  to enforce a language model to focus on this,   kind of, the hard-to-predict token, just like  actually it's going to confuse the language   model. There are so many different kinds of  the example like this. Just like, for example,   the serial number in your UPS. So the focus  of this paper is try to identify which token  

actually is more important for the language  model to learn. And actually the other token   maybe is just the noise. And how can we try  to discriminate the token—which is good token,   which is noise token. Basically, you try to  understand this kind of dynamic of the tokens.

TINGLE

How did you conduct this research?

CHEN

Actually we do a lot of work in the  model training, including the pretraining   and the post-training. So for the pretraining  side, actually the most important thing to us   is the data. We also try to understand, how can we  leverage the existing data, and how can we create   much more data, as well? And data basically is  one of the most important thing to build a better   foundation model. So we try to understand how much  more we can get from the data. And the important  

thing for the data is about data filtering. So you  think about actually in the previous literature,   we do the data filtering, for example, just  like we build a classifier to classify, OK,   this page is more important than the other. And  this page actually is a noise because there's so   much noise data in the web. So we just keep the  best data to get into the pretraining corpus. And   further away, we think about, OK, yeah, so this  is … maybe it's not fine grain enough, so can we  

try to understand even for the same page we want  to keep? So some token is more important than the   other. Maybe some token just some noise token.  Actually you put this data into the pretraining,   it's going to hurt the model quality. So there  is the motivation actually we try to think about.

TINGLE

And what were your major findings?

CHEN

Our major finding is about basically,  definitely this works so well. And it's so   important that actually we are able to get  the best token from the corpus and then   make it available and try to ask the model during  the pretraining to ignore the token we don't want   to get into the model itself. So that is one.  The second thing definitely data is the other   very important thing. If you're able to figure  out the better way to build a better data is  

most likely you’re able to build a much better  foundation model. The third thing actually is   also connected to a lot of other existing work,  just like data synthesis, just like distillation,   just like data filtering, and so a lot of things  are really connected together. And actually,   this work, basically, you can associate with also  a lot of other work we are working on, just like  

distillation. You can think about, for example,  for this work, we also try to build a model,   a reference model—we call as the reference  model—to try to identify actually this data,   this token, is more important than the other  and try to understand the discrepancy between   the reference model and the running model, their  prediction on each tokens. So you can think about   also it's some kind of the try to distill from the  reference model to the existing model, as well.

TINGLE

Let's talk a little bit about real-world  impact. Who benefits most from this work? And how   significant is this within your discipline and  even downstream for people using applications?

CHEN

This actually is very, very fundamental work  because just like I share a little bit before,   actually we build the data and this data is—build  the data much better—is able to build a much   better foundation model. If we're able to build  a better model actually is able to benefit so   many different kinds of application. This also  is going to help us to build a much better small   language model. And we can also serve this model  even in the edge side, in the client side, in the  

coding scenario. So we are going to see actually  huge impact from this kind of the foundation   model if you are able to benefit from  building much better training data.

TINGLE

Are there any unanswered  questions or unsolved problems   in this area? What's next on your research agenda?

CHEN

Yeah, I think that is a very good questions.  And definitely there's a lot of things about how   to build a better data [that] is unsolved yet in  the literature. And especially because when you   do the pretraining, the most important part is the  data, but the data is very limited. And how can we  

make better use from the existing limited data is  a big challenge. Because we can increase the model   by 10x, but it’s super hard to increase the data  by 10x, especially when we want to deal with the   high quality of data. The other way, even given  the data, how can you identify, especially for   this work, the importance of each token to build  a much better model? I think all these things are  

very connected together. To me, actually, data is  the oxygen. So there are still so many things we   are able to do in the data, including building for  even the small language model or the large model.

TINGLE

Data is oxygen—I love that! So other  than that being a key takeaway, is there any   other one thing that you'd like our listeners  to walk away from this conversation knowing?

CHEN

I would love to say actually focus more  on this kind of data and focus more about how   can I get more from the data actually;  it is the very important thing. And the   other thing actually, we are working  on something that's very exciting.   You can feel free to come to join us if  you are very interested in this area.

[MUSIC]

TINGLE

Well, Weizhu Chen, thank you for  joining us today. We really appreciate it.

CHEN

Thank you. Thank you for having me.

TINGLE

And thanks to our listeners for tuning  in. If you’d like to read the full paper, you may   find a link at aka.ms/abstracts. You can also find  the paper on arXiv and on the NeurIPS conference   website. I’m Amber Tingle from Microsoft Research,  and we hope you’ll join us next time on Abstracts!

[MUSIC FADES]

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android