Abstracts: July 29, 2024 - podcast episode cover

Abstracts: July 29, 2024

Jul 29, 20248 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

A lack of appropriate data, decreased model performance, and other obstacles have made it difficult to expand the input language models can receive. Li Lyna Zhang introduces LongRoPE, a method capable of extending content windows to more than 2 million tokens.

Read the paper

Get the code

Transcript

GRETCHEN HUIZINGA

Welcome to Abstracts,  a Microsoft Research Podcast that puts the   spotlight on world-class research in brief.  I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot   —or a podcast abstract—of their  new and noteworthy papers.

[MUSIC FADES]

GRETCHEN HUIZINGA

My guest today is Dr. Li Lyna Zhang, a senior  researcher at Microsoft Research. Dr. Zhang   is coauthor of a paper called “LongRoPE:  Extending LLM Context Window Beyond 2 Million   Tokens.” This paper was featured at this year's  International Conference on Machine Learning,   or ICML. Li, thanks so much for  joining us today on Abstracts!

LI LYNA ZHANG

Thank you for having me.

HUIZINGA

So let's start with a brief overview of   your paper. Tell us about the issue your  research addresses and why it matters.

ZHANG

OK, so this paper is about how to  effectively extend the context window of   large language models beyond 2 million tokens.  Why this is important? Because enabling longer   input contexts can improve LLM capabilities.  Right now, some LLMs can only handle a limited   context window of 4K tokens, which is about 10  pages in a book. With our method, we can push   LLM context window to over 2 million tokens. That  means you can put all seven Harry Potter books to  

the LLM and ask any question about this story!  Another important thing is that our method is   super efficient. It requires minimal changes  to the LLM architectures, and most existing   optimizations can be reused. Therefore, our  method can be easily applied in real production.

HUIZINGA

So it sounds like what you're  working on is improving the memory span   of artificial intelligence or large  language models. So what's already been   done in this field, and what unique  contributions does your work bring?

ZHANG

Well, there has been a lot of work  in building long-context LLMs. For example,   pretraining with an efficient model architecture,  using RAG (retrieval-augmented generation), and   extending the context window with RoPE  positional interpolation. Our approach   uses the last technique. Let me briefly explain  it. RoPE stands for rotary positional embedding,   which encodes token position information for  transformer models. When we pretrain an LLM, we  

set a context window size, and all token positions  have a predefined range of RoPE values. Extending   for a longer context window introduces new token  positions that can be out of this predefined   range, thus leading to out-of-distribution issues  and making fine-tuning difficult. RoPE positional   interpolation solves this by downscaling  positional embeddings to fit within the   pretrained range. However, positional embeddings  like RoPE exhibit non-uniform information entropy  

in transformer models. Existing approaches do not  effectively handle these non-uniformities during   RoPE interpolation, leading to information  loss and limiting the context window size.   Our method addresses this challenge; therefore,  it can achieve the longest context window size.

HUIZINGA

OK, so, Li, how would you describe  the methodology you used for this work,   and how did you go about conducting the research?

ZHANG

OK. So our method is to interpolate the  RoPE positional embedding. It has three main   steps. First, we introduce an efficient evolution  search algorithm to perform non-uniform RoPE   positional interpolation. Second, we propose  progressive context window extension strategy.   It begins by searching for a 256K length on  the pretrained LLM and fine-tuning it at this   length. Then, based on the fine-tuned 256K  LLM, we did a second search for new RoPE  

interpolations to achieve 2048K context  window size. Finally, since long-context   LLMs will drop performance at its original  context window, we readjusted the non-uniform   positional interpolation at a 4K length to  recover the short-context-window performance.

HUIZINGA

Let's talk about findings. Tell us how   things worked out for you and what you  found as a result of your experiments.

ZHANG

Yeah. Our study verified two important  non-uniformities in LLM context window extension.   We identified that lower RoPE dimensions  and initial token positions require less   interpolation because they contain crucial  and high-frequency information. Higher RoPE   dimensions require more interpolation because  these are sparse and low-frequency information.

HUIZINGA

So work in the  lab is always interesting,   but deployment in real-world settings is often  another story. If everything is successful,   Li, who benefits most from your LongRoPE research?

ZHANG

Well, our work significantly  improves LLM's capabilities to handle   long context in real-world applications, such  as long-context retrieval, code debugging,   and even multi-modality LLM applications.  Moreover, our method achieves this with   minimal modifications to the RoPE positional  embedding. Therefore, it can be widely applied   to production. We have integrated LongRoPE  into Microsoft Phi-3 128K family, which are  

the first long-context LLMs in its class. Before  LongRoPE, Phi models have only 2K context window.

HUIZINGA

So who is your primary user?

ZHANG

I think any users who want to use the  long-context LLMs, they can be our audience.

HUIZINGA

So it's a wide audience.

ZHANG

Yeah, it’s a wide audience.

HUIZINGA

It's about now that I always  ask the “golden nugget” question. If   you wanted to leave our listeners with one key  takeaway from this research, what would it be?

ZHANG

Well, if there's one key takeaway from  our work, it must be our key findings that   non-uniformities in rotary positional embedding  are crucial for LLM context window extension. And   if you want to build a high-quality long-context  LLM, LongRoPE is all you need to know!

HUIZINGA

Talk about what's left to do  in this field in terms of open questions   and outstanding challenges. What's  next on your research agenda, Li?

ZHANG

So far, there are still a couple  of big questions in this field. First,   it's challenging to achieve both strong  long and short capabilities at the same   time. Although we have managed to recover some  of the short performance for long-context LLM,  

it has not recovered 100 percent. We are trying  different approaches to close these gaps. Second,   we want to figure out how we can use these  long-context LLMs to solve more challenging tasks,   and then we can push this model  to work harder and smarter for us.

[MUSIC]

HUIZINGA

Well, Li Lyna Zhang, thanks for  joining us today, and to our listeners,   thanks for tuning in. If you want to read this  paper, you can find a link at aka.ms/abstracts,   or you can find it on arXiv.  See you next time on Abstracts!

[MUSIC FADES]

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android