Let your every day be full of joy, love the child that holds your hand, let your wife delight in your embrace, for these alone are the concerns of humanity.[1] — Epic of Gilgamesh - Tablet X Say we want to train a scientist AI to help in a precise, narrow field of science (e.g. medicine design) but prevent its power from being applied anywhere else (e.g. chatting with humans, designing bio-weapons, etc.) even if it has these abilities. Here's one safety layer one could implement: Train a scienti...
Jan 28, 2024•7 min
We are organising the 9th edition without funds. We have no personal runway left to do this again. We will not run the 10th edition without funding. In a nutshell: Last month, we put out AI Safety Camp's funding case. A private donor then decided to donate €5K. Five more donors offered $7K on Manifund. For that $7K to not be wiped out and returned, another $21K in funding is needed. At that level, we may be able to run a minimal version of AI Safety Camp next year, where we get research leads st...
Jan 25, 2024•2 min
Support ongoing human narrations of LessWrong's curated posts: www.patreon.com/LWCurated Crossposted from substack . As we all know, sugar is sweet and so are the $30B in yearly revenue from the artificial sweetener industry. Four billion years of evolution endowed our brains with a simple, straightforward mechanism to make sure we occasionally get an energy refuel so we can continue the foraging a little longer, and of course we are completely ignoring the instructions and spend billions on fak...
Jan 22, 2024•12 min
This is a linkpost for https://arxiv.org/abs/2401.05566 Support ongoing human narrations of LessWrong's curated posts: www.patreon.com/LWCurated Source: https://www.lesswrong.com/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training- deceptive-llms-that-persist-through Narrated for LessWrong by Perrin Walker . Share feedback on this narration. [Curated Post] ✓ [ 125+ Karma Post] ✓...
Jan 20, 2024•9 min
Support ongoing human narrations of LessWrong's curated posts: www.patreon.com/LWCurated Source: https://www.lesswrong.com/posts/tEPHGZAb63dfq2v8n/how-useful-is-mechanistic-interpretability Narrated for LessWrong by Perrin Walker . Share feedback on this narration. [Curated Post] ✓ [ 125+ Karma Post] ✓...
Jan 20, 2024•41 min
I wrote this entire post in February of 2023, during the fallout from the TIME article. I didn't post it at the time for multiple reasons: because I had no desire to get involved in all that nonsense because I was horribly burned out from my own community conflict investigation and couldn't stand the thought of engaging with people online because I generally think it's bad to post on the internet out of frustration or outrage But after sitting on it for a full year, I still think it's worth post...
Jan 17, 2024•24 min
"( Cross-posted from my website . Audio version here , or search "Joe Carlsmith Audio" on your podcast app.)" T his is the first essay in a series that I’m calling “Otherness and control in the age of AGI.” See here for more about the series as a whole. ) When species meet The most succinct argument for AI risk, in my opinion, is the “second species” argument. Basically, it goes like this. Premise 1 : AGIs would be like a second advanced species on earth, more powerful than humans. Conclusion : ...
Jan 14, 2024•23 min
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Following on from our recent paper, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, I’m very excited to announce that I have started leading a new team at Anthropic, the Alignment Stress-Testing team, with Carson Denison and Monte MacDiarmid as current team members. Our mission—and our mandate from the organization—is to red-team Anthropic's alignment techniques and evaluations, ...
Jan 14, 2024•3 min
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a linkpost for https://arxiv.org/abs/2401.05566I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I...
Jan 13, 2024•7 min
Support ongoing human narrations of LessWrong's curated posts: www.patreon.com/LWCurated The goal of this post is to clarify a few concepts relating to AI Alignment under a common framework. The main concepts to be clarified: Optimization. Specifically, this will be a type of Vingean agency . It will split into Selection vs Control variants. Reference (the relationship which holds between map and territory; aka semantics, aka meaning). Specifically, this will be a teleosemantic theory. The main ...
Jan 07, 2024•30 min
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Thanks to Clément Dumas, Nikola Jurković, Nora Belrose, Arthur Conmy, and Oam Patel for feedback. In the comments of the post on Google Deepmind's CCS challenges paper, I expressed skepticism that some of the experimental results seemed possible. When addressing my concerns, Rohin Shah made some claims along the lines of “If an LLM linearly represents features a and b, then it will also linearly represent their...
Jan 07, 2024•29 min
(Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app. This is the first essay in a series that I’m calling “Otherness and control in the age of AGI.” See here for more about the series as a whole.) When species meet The most succinct argument for AI risk, in my opinion, is the “second species” argument. Basically, it goes like this. Premise 1: AGIs would be like a second advanced species on earth, more powerful than humans. Conclusion: That's sca...
Jan 05, 2024•23 min
As we announced back in October, I have taken on the senior leadership role at MIRI as its CEO. It's a big pair of shoes to fill, and an awesome responsibility that I’m honored to take on. There have been several changes at MIRI since our 2020 strategic update, so let's get into it.[1] The short version: We think it's very unlikely that the AI alignment field will be able to make progress quickly enough to prevent human extinction and the loss of the future's potential value, that we expect will...
Jan 05, 2024•14 min
Background: The Plan, The Plan: 2022 Update. If you haven’t read those, don’t worry, we’re going to go through things from the top this year, and with moderately more detail than before. 1. What's Your Plan For AI Alignment? Median happy trajectory: Sort out our fundamental confusions about agency and abstraction enough to do interpretability that works and generalizes robustly. Look through our AI's internal concepts for a good alignment target, then Retarget the Search [1]. … Profit! We’ll tal...
Jan 04, 2024•58 min
In certain circumstances, apologizing can also be a countersignalling power-move, i.e. “I am so high status that I can grovel a bit without anybody mistaking me for a general groveller”. But that's not really the type of move this post is focused on.There's this narrative about a tradeoff between: The virtue of Saying Oops, early and often, correcting course rather than continuing to pour oneself into a losing bet, vs The loss of social status one suffers by admitting defeat, rather than spinnin...
Jan 03, 2024•10 min
This is a linkpost for https://unstableontology.com/2023/12/31/a-case-for-ai-alignment-being-difficult/ Support ongoing human narrations of LessWrong's curated posts: www.patreon.com/LWCurated This is an attempt to distill a model of AGI alignment that I have gained primarily from thinkers such as Eliezer Yudkowsky (and to a lesser extent Paul Christiano), but explained in my own terms rather than attempting to hew close to these thinkers. I think I would be pretty good at passing an ideological...
Jan 02, 2024•29 min
lsusrIt is my understanding that you won all of your public forum debates this year. That's very impressive. I thought it would be interesting to discuss some of the techniques you used. LyrongolemOf course! So, just for a brief overview for those who don't know, public forum is a 2v2 debate format, usually on a policy topic. One of the more interesting ones has been the last one I went to, where the topic was "Resolved: The US Federal Government Should Substantially Increase its Military Presen...
Jan 01, 2024•18 min
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a review of Paul Christiano's article "where I agree and disagree with Eliezer". Written for the LessWrong 2022 Review. In the existential AI safety community, there is an ongoing debate between positions situated differently on some axis which doesn't have a common agreed-upon name, but where Christiano and Yudkowsky can be regarded as representatives of the two directions[1]. For the sake of this revi...
Dec 28, 2023•28 min
This point feels fairly obvious, yet seems worth stating explicitly. Those of us familiar with the field of AI after the deep-learning revolution know perfectly well that we have no idea how our ML models work. Sure, we have an understanding of the dynamics of training loops and SGD's properties, and we know how ML models' architectures work. But we don't know what specific algorithms ML models' forward passes implement. We have some guesses, and some insights painstakingly mined by interpretabi...
Dec 27, 2023•3 min
TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won’t be directly helpful either (70%). We’ve written a paper about some of our detailed experiences with it...
Dec 26, 2023•18 min
This is a linkpost for https://www.narrativeark.xyz/p/succession“A table beside the evening sea where you sit shelling pistachios, flicking the next open with the half- shell of the last, story opening story, on down to the sandy end of time.” V1: Leaving Deceleration is the hardest part. Even after burning almost all of my fuel, I’m still coming in at 0.8c. I’ve planned a slingshot around the galaxy's central black hole which will slow me down even further, but at this speed it’ll require incre...
Dec 24, 2023•19 min
Recently, Ben Pace wrote a well-intentioned blog post mostly based on complaints from 2 (of 21) Nonlinear employees who 1) wanted more money, 2) felt socially isolated, and 3) felt persecuted/oppressed. Of relevance, one has accused the majority of her previous employers, and 28 people of abuse - that we know of. She has accused multiple people of threatening to kill her and literally accused an ex-employer of murder. Within three weeks of joining us, she had accused five separate people of abus...
Dec 21, 2023•1 hr
The New York Times Picture a scene: the New York Times is releasing an article on Effective Altruism (EA) with an express goal to dig up every piece of negative information they can find. They contact Émile Torres, David Gerard, and Timnit Gebru, collect evidence about Sam Bankman-Fried, the OpenAI board blowup, and Pasek's Doom, start calling Astral Codex Ten (ACX) readers to ask them about rumors they'd heard about affinity between Effective Altruists, neoreactionaries, and something called TE...
Dec 20, 2023•54 min
At the Bay Area Solstice, I heard the song Bold Orion for the first time. I like it a lot. It does, however, have one problem: He has seen the rise and fall of kings and continents and all, Rising silent, bold Orion on the rise. Orion has not witnessed the rise and fall of continents. Constellations are younger than continents. The time scale that continents change on is ten or hundreds of millions of years. The time scale that stars the size of the sun live and die on is billions of years. So s...
Dec 20, 2023•4 min
Many thanks to Samuel Hammond, Cate Hall, Beren Millidge, Steve Byrnes, Lucius Bushnaq, Joar Skalse, Kyle Gracey, Gunnar Zarncke, Ross Nordby, David Lambert, Simeon Campos, Bogdan Ionut-Cirstea, Ryan Kidd, Eric Ho, and Ashwin Acharya for critical comments and suggestions on earlier drafts of this agenda, as well as Philip Gubbins, Diogo de Lucena, Rob Luke, and Mason Seale from AE Studio for their support and feedback throughout. TL;DR Our initial theory of change at AE Studio was a 'neglected a...
Dec 19, 2023•25 min
When discussing AGI Risk, people often talk about it in terms of a war between humanity and an AGI. Comparisons between the amounts of resources at both sides' disposal are brought up and factored in, big impressive nuclear stockpiles are sometimes waved around, etc. I'm pretty sure it's not how that'd look like, on several levels. 1. Threat Ambiguity I think what people imagine, when they imagine a war, is Terminator-style movie scenarios where the obviously evil AGI becomes obviously evil in a...
Dec 18, 2023•9 min
Epistemic status: Speculation. An unholy union of evo psych, introspection, random stuff I happen to observe & hear about, and thinking. Done on a highly charged topic. Caveat emptor! Most of my life, whenever I'd felt sexually unwanted, I'd start planning to get fit. Specifically to shape my body so it looks hot. Like the muscly guys I'd see in action films. This choice is a little odd. In close to every context I've listened to, I hear women say that some muscle tone on a guy is nice and a...
Dec 17, 2023•23 min
Support ongoing human narrations of LessWrong's curated posts: www.patreon.com/LWCurated TL;DR version In the course of my life, there have been a handful of times I discovered an idea that changed the way I thought about the world. The first occurred when I picked up Nick Bostrom’s book “superintelligence” and realized that AI would utterly transform the world. The second was when I learned about embryo selection and how it could change future generations. And the third happened a few months ag...
Dec 17, 2023•1 hr 1 min
Support ongoing human narrations of LessWrong's curated posts: www.patreon.com/LWCurated This is a linkpost for https://unstableontology.com/2023/11/26/moral-reality-check/ Janet sat at her corporate ExxenAI computer, viewing some training performance statistics. ExxenAI was a major player in the generative AI space, with multimodal language, image, audio, and video AIs. They had scaled up operations over the past few years, mostly serving B2B, but with some B2C subscriptions. ExxenAI's newest A...
Dec 15, 2023•40 min
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post: We summarize the paper; We compare our methodology to what the one used in other safety papers. The next post in this sequence (which we’ll release in the coming weeks) discusses what we...
Dec 15, 2023•17 min