Abstracts: July 29, 2024

GRETCHEN HUIZINGA

00:02

Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot —or a podcast abstract—of their new and noteworthy papers.

00:25

[MUSIC FADES]

GRETCHEN HUIZINGA

00:25

My guest today is Dr. Li Lyna Zhang, a senior researcher at Microsoft Research. Dr. Zhang is coauthor of a paper called “LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens.” This paper was featured at this year's International Conference on Machine Learning, or ICML. Li, thanks so much for joining us today on Abstracts!

LI LYNA ZHANG

00:47

Thank you for having me.

HUIZINGA

00:49

So let's start with a brief overview of your paper. Tell us about the issue your research addresses and why it matters.

ZHANG

00:57

OK, so this paper is about how to effectively extend the context window of large language models beyond 2 million tokens. Why this is important? Because enabling longer input contexts can improve LLM capabilities. Right now, some LLMs can only handle a limited context window of 4K tokens, which is about 10 pages in a book. With our method, we can push LLM context window to over 2 million tokens. That means you can put all seven Harry Potter books to

01:36

the LLM and ask any question about this story! Another important thing is that our method is super efficient. It requires minimal changes to the LLM architectures, and most existing optimizations can be reused. Therefore, our method can be easily applied in real production.

HUIZINGA

01:59

So it sounds like what you're working on is improving the memory span of artificial intelligence or large language models. So what's already been done in this field, and what unique contributions does your work bring?

ZHANG

02:12

Well, there has been a lot of work in building long-context LLMs. For example, pretraining with an efficient model architecture, using RAG (retrieval-augmented generation), and extending the context window with RoPE positional interpolation. Our approach uses the last technique. Let me briefly explain it. RoPE stands for rotary positional embedding, which encodes token position information for transformer models. When we pretrain an LLM, we

02:46

set a context window size, and all token positions have a predefined range of RoPE values. Extending for a longer context window introduces new token positions that can be out of this predefined range, thus leading to out-of-distribution issues and making fine-tuning difficult. RoPE positional interpolation solves this by downscaling positional embeddings to fit within the pretrained range. However, positional embeddings like RoPE exhibit non-uniform information entropy

03:25

in transformer models. Existing approaches do not effectively handle these non-uniformities during RoPE interpolation, leading to information loss and limiting the context window size. Our method addresses this challenge; therefore, it can achieve the longest context window size.

HUIZINGA

03:46

OK, so, Li, how would you describe the methodology you used for this work, and how did you go about conducting the research?

ZHANG

03:54

OK. So our method is to interpolate the RoPE positional embedding. It has three main steps. First, we introduce an efficient evolution search algorithm to perform non-uniform RoPE positional interpolation. Second, we propose progressive context window extension strategy. It begins by searching for a 256K length on the pretrained LLM and fine-tuning it at this length. Then, based on the fine-tuned 256K LLM, we did a second search for new RoPE

04:33

interpolations to achieve 2048K context window size. Finally, since long-context LLMs will drop performance at its original context window, we readjusted the non-uniform positional interpolation at a 4K length to recover the short-context-window performance.

HUIZINGA

04:52

Let's talk about findings. Tell us how things worked out for you and what you found as a result of your experiments.

ZHANG

04:59

Yeah. Our study verified two important non-uniformities in LLM context window extension. We identified that lower RoPE dimensions and initial token positions require less interpolation because they contain crucial and high-frequency information. Higher RoPE dimensions require more interpolation because these are sparse and low-frequency information.

HUIZINGA

05:27

So work in the lab is always interesting, but deployment in real-world settings is often another story. If everything is successful, Li, who benefits most from your LongRoPE research?

ZHANG

05:40

Well, our work significantly improves LLM's capabilities to handle long context in real-world applications, such as long-context retrieval, code debugging, and even multi-modality LLM applications. Moreover, our method achieves this with minimal modifications to the RoPE positional embedding. Therefore, it can be widely applied to production. We have integrated LongRoPE into Microsoft Phi-3 128K family, which are

06:13

the first long-context LLMs in its class. Before LongRoPE, Phi models have only 2K context window.

HUIZINGA

06:22

So who is your primary user?

ZHANG

06:25

I think any users who want to use the long-context LLMs, they can be our audience.

HUIZINGA

06:32

So it's a wide audience.

ZHANG

06:34

Yeah, it’s a wide audience.

HUIZINGA

06:35

It's about now that I always ask the “golden nugget” question. If you wanted to leave our listeners with one key takeaway from this research, what would it be?

ZHANG

06:45

Well, if there's one key takeaway from our work, it must be our key findings that non-uniformities in rotary positional embedding are crucial for LLM context window extension. And if you want to build a high-quality long-context LLM, LongRoPE is all you need to know!

HUIZINGA

07:06

Talk about what's left to do in this field in terms of open questions and outstanding challenges. What's next on your research agenda, Li?

ZHANG

07:16

So far, there are still a couple of big questions in this field. First, it's challenging to achieve both strong long and short capabilities at the same time. Although we have managed to recover some of the short performance for long-context LLM,

07:33

it has not recovered 100 percent. We are trying different approaches to close these gaps. Second, we want to figure out how we can use these long-context LLMs to solve more challenging tasks, and then we can push this model to work harder and smarter for us.

07:53

[MUSIC]

HUIZINGA

07:53

Well, Li Lyna Zhang, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts, or you can find it on arXiv. See you next time on Abstracts!

08:18

[MUSIC FADES]

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript