Abstracts: NeurIPS 2024 with Weizhu Chen

00:02

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.  [MUSIC FADES] Our guest today is Weizhu Chen. He is vice president of Microsoft GenAI and coauthor

00:34

of a paper called “Not All Tokens Are What You Need for Pretraining.” This paper is an oral presentation during the 38th annual Conference on Neural Information Processing Systems, also known as NeurIPS, which is happening this week in Vancouver. Weizhu, thank you for joining us today on Abstracts!

WEIZHU CHEN

00:55

Thank you for having me, Amber.

TINGLE

00:57

So let's start with a brief overview of your paper. In a couple sentences, tell us about the problem your research addresses and, more importantly, why the research community and beyond should know about this work.

CHEN

01:11

So my team basically in Microsoft GenAI, we are working on model training. So one of the things actually we do in the pretraining, we realize the importance of the data. And we found that actually when we do this kind of data for each of the tokens, some token is more important than the other. That's one. The other one actually is some token actually is very, very hard to be predicted during the pretraining. So, for example, just like if someone see the text

01:40

of “Weizhu,” and what's the next token? It can be “Chen”; it can be any of the last name. So it's very hard to be predicted. And if we try to enforce a language model to focus on this, kind of, the hard-to-predict token, just like actually it's going to confuse the language model. There are so many different kinds of the example like this. Just like, for example, the serial number in your UPS. So the focus of this paper is try to identify which token

02:04

actually is more important for the language model to learn. And actually the other token maybe is just the noise. And how can we try to discriminate the token—which is good token, which is noise token. Basically, you try to understand this kind of dynamic of the tokens.

TINGLE

02:20

How did you conduct this research?

CHEN

02:23

Actually we do a lot of work in the model training, including the pretraining and the post-training. So for the pretraining side, actually the most important thing to us is the data. We also try to understand, how can we leverage the existing data, and how can we create much more data, as well? And data basically is one of the most important thing to build a better foundation model. So we try to understand how much more we can get from the data. And the important

02:56

thing for the data is about data filtering. So you think about actually in the previous literature, we do the data filtering, for example, just like we build a classifier to classify, OK, this page is more important than the other. And this page actually is a noise because there's so much noise data in the web. So we just keep the best data to get into the pretraining corpus. And further away, we think about, OK, yeah, so this is … maybe it's not fine grain enough, so can we

03:26

try to understand even for the same page we want to keep? So some token is more important than the other. Maybe some token just some noise token. Actually you put this data into the pretraining, it's going to hurt the model quality. So there is the motivation actually we try to think about.

TINGLE

03:43

And what were your major findings?

CHEN

03:46

Our major finding is about basically, definitely this works so well. And it's so important that actually we are able to get the best token from the corpus and then make it available and try to ask the model during the pretraining to ignore the token we don't want to get into the model itself. So that is one. The second thing definitely data is the other very important thing. If you're able to figure out the better way to build a better data is

04:18

most likely you’re able to build a much better foundation model. The third thing actually is also connected to a lot of other existing work, just like data synthesis, just like distillation, just like data filtering, and so a lot of things are really connected together. And actually, this work, basically, you can associate with also a lot of other work we are working on, just like

04:43

distillation. You can think about, for example, for this work, we also try to build a model, a reference model—we call as the reference model—to try to identify actually this data, this token, is more important than the other and try to understand the discrepancy between the reference model and the running model, their prediction on each tokens. So you can think about also it's some kind of the try to distill from the reference model to the existing model, as well.

TINGLE

05:17

Let's talk a little bit about real-world impact. Who benefits most from this work? And how significant is this within your discipline and even downstream for people using applications?

CHEN

05:29

This actually is very, very fundamental work because just like I share a little bit before, actually we build the data and this data is—build the data much better—is able to build a much better foundation model. If we're able to build a better model actually is able to benefit so many different kinds of application. This also is going to help us to build a much better small language model. And we can also serve this model even in the edge side, in the client side, in the

05:59

coding scenario. So we are going to see actually huge impact from this kind of the foundation model if you are able to benefit from building much better training data.

TINGLE

06:10

Are there any unanswered questions or unsolved problems in this area? What's next on your research agenda?

CHEN

06:18

Yeah, I think that is a very good questions. And definitely there's a lot of things about how to build a better data [that] is unsolved yet in the literature. And especially because when you do the pretraining, the most important part is the data, but the data is very limited. And how can we

06:42

make better use from the existing limited data is a big challenge. Because we can increase the model by 10x, but it’s super hard to increase the data by 10x, especially when we want to deal with the high quality of data. The other way, even given the data, how can you identify, especially for this work, the importance of each token to build a much better model? I think all these things are

07:08

very connected together. To me, actually, data is the oxygen. So there are still so many things we are able to do in the data, including building for even the small language model or the large model.

TINGLE

07:20

Data is oxygen—I love that! So other than that being a key takeaway, is there any other one thing that you'd like our listeners to walk away from this conversation knowing?

CHEN

07:32

I would love to say actually focus more on this kind of data and focus more about how can I get more from the data actually; it is the very important thing. And the other thing actually, we are working on something that's very exciting. You can feel free to come to join us if you are very interested in this area.

07:54

[MUSIC]

TINGLE

07:54

Well, Weizhu Chen, thank you for joining us today. We really appreciate it.

CHEN

07:58

Thank you. Thank you for having me.

TINGLE

08:00

And thanks to our listeners for tuning in. If you’d like to read the full paper, you may find a link at aka.ms/abstracts. You can also find the paper on arXiv and on the NeurIPS conference website. I’m Amber Tingle from Microsoft Research, and we hope you’ll join us next time on Abstracts!

08:27

[MUSIC FADES]

Transcript source: Provided by creator in RSS feed: download file

Abstracts: NeurIPS 2024 with Weizhu Chen

Episode description

Transcript