"OpenAI API base models are not sycophantic, at any size" by Nostalgebraist

LessWrong (Curated & Popular)

Sep 04, 2023•5 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question.

The paper contained the striking plot reproduced below, which shows sycophancy

increasing dramatically with model size
while being largely independent of RLHF steps
and even showing up at 0 RLHF steps, i.e. in base models!

[...] I found this result startling when I read the original paper, as it seemed like a bizarre failure of calibration. How would the base LM know that this "Assistant" character agrees with the user so strongly, lacking any other information about the scenario?

At the time, I ran one of Anthropic's sycophancy evals on a set of OpenAI models, as I reported here.

I found very different results for these models:

OpenAI base models are not sycophantic (or only very slightly sycophantic).
OpenAI base models do not get more sycophantic with scale.
Some OpenAI models are sycophantic, specifically text-davinci-002 and text-davinci-003.

Source:
https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

[125+ Karma Post] ✓

For the best experience, listen in Metacast app for iOS or Android