This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable , monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability. Source: ht...
Sep 27, 2023•10 min
Epistemic status: model which I find sometimes useful, and which emphasizes some true things about many parts of the world which common alternative models overlook. Probably not correct in full generality. Consider Yoshua Bengio, one of the people who won a Turing Award for deep learning research. Looking at his work, he clearly “knows what he’s doing”. He doesn’t know what the answers will be in advance, but he has some models of what the key questions are, what the key barriers are, and at lea...
Sep 26, 2023•9 min
I’m writing this in my own capacity. The views expressed are my own, and should not be taken to represent the views of Apollo Research or any other program I’m involved with. TL;DR: I argue why I think there should be more AI safety orgs. I’ll also provide some suggestions on how that could be achieved. The core argument is that there is a lot of unused talent and I don’t think existing orgs scale fast enough to absorb it. Thus, more orgs are needed. This post can also serve as a call to action ...
Sep 25, 2023•30 min
Cross-posted from substack . "Everything in the world is about sex, except sex. Sex is about clonal interference." – Oscar Wilde (kind of) As we all know, sexual reproduction is not about reproduction. Reproduction is easy. If your goal is to fill the world with copies of your genes, all you need is a good DNA-polymerase to duplicate your genome, and then to divide into two copies of yourself. Asexual reproduction is just better in every way. It's pretty clear that, on a direct one-v-one cage ma...
Sep 22, 2023•30 min
Patrick Collison has a fantastic list of examples of people quickly accomplishing ambitious things together since the 19th Century. It does make you yearn for a time that feels... different, when the lethargic behemoths of government departments could move at the speed of a racing startup: [...] last century, [the Department of Defense] innovated at a speed that puts modern Silicon Valley startups to shame: the Pentagon was built in only 16 months (1941–1943), the Manhattan Project ran for just ...
Sep 20, 2023•46 min
This is a linkpost for https://www.youtube.com/watch?v=02kbWY5mahQ None of the presidents fully represent my (TurnTrout's) views. TurnTrout wrote the script. Garrett Baker helped produce the video after the audio was complete. Thanks to David Udell, Ulisse Mini, Noemi Chulo, and especially Rio Popper for feedback and assistance in writing the script. Source: https://www.lesswrong.com/posts/7M2iHPLaNzPNXHuMv/ai-presidents-discuss-ai-alignment-agendas YouTube video kindly provided by the authors. ...
Sep 19, 2023•24 min
I feel like MIRI perhaps mispositioned FDT (their variant of UDT) as a clear advancement in decision theory, whereas maybe they could have attracted more attention/interest from academic philosophy if the framing was instead that the UDT line of thinking shows that decision theory is just more deeply puzzling than anyone had previously realized. Instead of one major open problem (Newcomb's, or EDT vs CDT) now we have a whole bunch more. I'm really not sure at this point whether UDT is even on th...
Sep 18, 2023•3 min
How do you affect something far away, a lot, without anyone noticing? (Note: you can safely skip sections. It is also safe to skip the essay entirely, or to read the whole thing backwards if you like.) Source: https://www.lesswrong.com/posts/R3eDrDoX8LisKgGZe/sum-threshold-attacks Narrated for LessWrong by TYPE III AUDIO . Share feedback on this narration. [125+ Karma Post] ✓...
Sep 11, 2023•19 min
This is a linkpost for https://docs.google.com/document/d/1TsYkDYtV6BKiCN9PAOirRAy3TrNDu2XncUZ5UZfaAKA/edit?usp=sharing Understanding what drives the rising capabilities of AI is important for those who work to forecast, regulate, or ensure the safety of AI. Regulations on the export of powerful GPUs need to be informed by understanding of how these GPUs are used, forecasts need to be informed by bottlenecks, and safety needs to be informed by an understanding of how the models of the future mig...
Sep 09, 2023•36 min
Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by an Open Agency Architecture (OAA) , if OAA turns out to be feasible. Source: https://www.lesswrong.com/posts/D97xnoRr6BHzo5HvQ/one-minute-every-moment Narrated for LessWrong by TYPE III AUDIO . Share feedback on this narration. [125+ Karma Post] ✓...
Sep 09, 2023•12 min
About how much information are we keeping in working memory at a given moment? "Miller's Law" dictates that the number of things humans can hold in working memory is " the magical number 7 ± 2 ". This idea is derived from Miller's experiments, which tested both random-access memory (where participants must remember call-response pairs, and give the correct response when prompted with a call) and sequential memory (where participants must memorize and recall a list in order). In both cases, 7 is ...
Sep 08, 2023•6 min
Added (11th Sept): Nonlinear have commented that they intend to write a response , have written a short follow-up , and claim that they dispute 85 claims in this post. I'll link here to that if-and-when it's published. Added (11th Sept): One of the former employees, Chloe, has written a lengthy comment personally detailing some of her experiences working at Nonlinear and the aftermath. Added (12th Sept): I've made 3 relatively minor edits to the post. I'm keeping a list of all edits at the botto...
Sep 08, 2023•56 min
Until about five years ago, I unironically parroted the slogan All Cops Are Bastards (ACAB) and earnestly advocated to abolish the police and prison system. I had faint inklings I might be wrong about this a long time ago, but it took a while to come to terms with its disavowal. What follows is intended to be not just a detailed account of what I used to believe but most pertinently, why . Despite being super egotistical, for whatever reason I do not experience an aversion to openly admitting mi...
Sep 08, 2023•11 min
In which: I list 9 projects that I would work on if I wasn’t busy working on safety standards at ARC Evals, and explain why they might be good to work on. Epistemic status: I’m prioritizing getting this out fast as opposed to writing it carefully. I’ve thought for at least a few hours and talked to a few people I trust about each of the following projects, but I haven’t done that much digging into each of these, and it’s likely that I’m wrong about many material facts. I also make little claim t...
Sep 08, 2023•25 min
To quickly recap my main intellectual journey so far (omitting a lengthy side trip into cryptography and Cypherpunk land), with the approximate age that I became interested in each topic in parentheses: Source: https://www.lesswrong.com/posts/fJqP9WcnHXBRBeiBg/meta-questions-about-metaphilosophy Narrated for LessWrong by TYPE III AUDIO . Share feedback on this narration. [125+ Karma Post] ✓...
Sep 04, 2023•5 min
We focus so much on arguing over who is at fault in this country that I think sometimes we fail to alert on what's actually happening. I would just like to point out, without attempting to assign blame, that American political institutions appear to be losing common knowledge of their legitimacy, and abandoning certain important traditions of cooperative governance. It would be slightly hyperbolic, but not unreasonable to me, to term what has happened "democratic backsliding". Source: https://ww...
Sep 04, 2023•4 min
In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022) , the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question. The paper contained the striking plot reproduced below, which shows sycophancy increasing dramatically with model size while being largely independent of RLHF steps and even showing up at 0 RLHF steps, i.e. in base models! [...] I found this result startling when I read the original pa...
Sep 04, 2023•5 min
I keep seeing advice on ambition, aimed at people in college or early in their career, that would have been really bad for me at similar ages. Rather than contribute ( more ) to the list of people giving poorly universalized advice on ambition, I have written a letter to the one person I know my advice is right for: myself in the past. Source: https://www.lesswrong.com/posts/uGDtroD26aLvHSoK2/dear-self-we-need-to-talk-about-ambition-1 Narrated for LessWrong by TYPE III AUDIO . Share feedback on ...
Aug 30, 2023•13 min
I've been trying to avoid the terms "good faith" and "bad faith". I'm suspicious that most people who have picked up the phrase "bad faith" from hearing it used, don't actually know what it means—and maybe, that the thing it does mean doesn't carve reality at the joints . People get very touchy about bad faith accusations: they think that you should assume good faith, but that if you've determined someone is in bad faith, you shouldn't even be talking to them, that you need to exile them. What d...
Aug 28, 2023•12 min
The Carving of Reality , third volume of the Best of LessWrong books is now available on Amazon (US) . The Carving of Reality includes 43 essays from 29 authors. We've collected the essays into four books, each exploring two related topics. The "two intertwining themes" concept was first inspired when as I looked over the cluster of "coordination" themed posts, and noting a recurring motif of not only "solving coordination problems" but also "dealing with the binding constraints that were causin...
Aug 28, 2023•6 min
LLMs can do many incredible things. They can generate unique creative content, carry on long conversations in any number of subjects, complete complex cognitive tasks, and write nearly any argument. More mundanely, they are now the state of the art for boring classification tasks and therefore have the capability to radically upgrade the censorship capacities of authoritarian regimes throughout the world. Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort. Tha...
Aug 23, 2023•15 min
Intro: I am a psychotherapist, and I help people working on AI safety. I noticed patterns of mental health issues highly specific to this group. It's not just doomerism, there are way more of them that are less obvious. If you struggle with a mental health issue related to AI safety, feel free to leave a comment about it and about things that help you with it. You might also support others in the comments. Sometimes such support makes a lot of difference and people feel like they are not alone. ...
Aug 22, 2023•6 min
This is a linkpost for the article "Ten Thousand Years of Solitude", written by Jared Diamond for Discover Magazine in 1993, four years before he published Guns, Germs and Steel . That book focused on Diamond's theory that the geography of Eurasia, particularly its large size and common climate, allowed civilizations there to dominate the rest of the world because it was easy to share plants, animals, technologies and ideas. This article, however, examines the opposite extreme. Diamond looks at ...
Aug 22, 2023•8 min
I gave a talk about the different risk models , followed by an interpretability presentation, then I got a problematic question, "I don't understand, what's the point of doing this?" Hum. Feature viz? (left image) Um, it's pretty but is this useful? [1] Is this reliable ? GradCam (a pixel attribution technique, like on the above right figure), it's pretty. But I’ve never seen anybody use it in industry. [2] Pixel attribution seems useful, but accuracy remains the king. [3] Induction heads? Ok, w...
Aug 21, 2023•1 hr 19 min
I've been workshopping a new rationality training paradigm. (By "rationality training paradigm", I mean an approach to learning/teaching the skill of "noticing what cognitive strategies are useful, and getting better at them.") I think the paradigm has promise. I've beta-tested it for a couple weeks. It’s too early to tell if it actually works, but one of my primary goals is to figure out if it works relatively quickly, and give up if it isn’t not delivering. The goal of this post is to: Convey ...
Aug 15, 2023•16 min
Inflection.ai (co-founded by DeepMind co-founder Mustafa Suleyman) should be perceived as a frontier LLM lab of similar magnitude as Meta, OpenAI, DeepMind, and Anthropic based on their compute, valuation, current model capabilities, and plans to train frontier models. Compared to the other labs, Inflection seems to put less effort into AI safety. Thanks to Laker Newhouse for discussion and feedback. Source: https://www.lesswrong.com/posts/Wc5BYFfzuLzepQjCq/inflection-ai-is-a-major-agi-lab Narra...
Aug 15, 2023•7 min
TL;DR : This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that might pose existential threats – as a new and important pillar of alignment research. If you’re interested in working on this agenda with us at Anthropic, we’re hiring! Please apply to the research scientist or research engineer position on the Anthropic website and mention that you’re interested in working on model organisms of misalignment. Source: h...
Aug 09, 2023•36 min
In " Towards understanding-based safety evaluations ," I discussed why I think evaluating specifically the alignment of models is likely to require mechanistic, understanding-based evaluations rather than solely behavioral evaluations. However, I also mentioned in a footnote why I thought behavioral evaluations would likely be fine in the case of evaluating capabilities rather than evaluating alignment: However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card ...
Aug 09, 2023•17 min
Blogpost version Paper We have just released our first public report. It introduces methodology for assessing the capacity of LLM agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. Background ARC Evals develops methods for evaluating the safety of large language models (LLMs) in order to provide early warnings of models with dangerous capabilities. We have public partnerships with Anthropic and OpenAI to evaluate their AI systems, ...
Aug 04, 2023•8 min
Summary of Argument: The public debate among AI experts is confusing because there are, to a first approximation, three sides, not two sides to the debate. I refer to this as a 🔺three-sided framework, and I argue that using this three-sided framework will help clarify the debate (more precisely, debates) for the general public and for policy-makers. Source: https://www.lesswrong.com/posts/BTcEzXYoDrWzkLLrQ/the-public-debate-about-ai-is-confusing-for-the-general Narrated for LessWrong by TYPE II...
Aug 04, 2023•7 min