I'm not writing this to alarm anyone, but it would be irresponsible not to report on something this important. On current trends, every car will be crashed in front of my house within the next week. Here's the data: Until today, only two cars had crashed in front of my house, several months apart, during the 15 months I have lived here. But a few hours ago it happened again, mere weeks from the previous crash. This graph may look harmless enough, but now consider the frequency of crashes this im...
Apr 02, 2025•2 min
Remember: There is no such thing as a pink elephant. Recently, I was made aware that my “infohazards small working group” Signal chat, an informal coordination venue where we have frank discussions about infohazards and why it will be bad if specific hazards were leaked to the press or public, accidentally was shared with a deceitful and discredited so-called “journalist,” Kelsey Piper. She is not the first person to have been accidentally sent sensitive material from our group chat, however she...
Apr 02, 2025•11 min
Let's cut through the comforting narratives and examine a common behavioral pattern with a sharper lens: the stark difference between how anger is managed in professional settings versus domestic ones. Many individuals can navigate challenging workplace interactions with remarkable restraint, only to unleash significant anger or frustration at home shortly after. Why does this disparity exist? Common psychological explanations trot out concepts like "stress spillover," "ego depletion," or the ho...
Apr 02, 2025•6 min
In the debate over AI development, two movements stand as opposites: PauseAI calls for slowing down AI progress, and e/acc (effective accelerationism) calls for rapid advancement. But what if both sides are working against their own stated interests? What if the most rational strategy for each would be to adopt the other's tactics—if not their ultimate goals? AI development speed ultimately comes down to policy decisions, which are themselves downstream of public opinion. No matter how compellin...
Apr 02, 2025•4 min
Introduction Decision theory is about how to behave rationally under conditions of uncertainty, especially if this uncertainty involves being acausally blackmailed and/or gaslit by alien superintelligent basilisks. Decision theory has found numerous practical applications, including proving the existence of God and generating endless LessWrong comments since the beginning of time. However, despite the apparent simplicity of "just choose the best action", no comprehensive decision theory that res...
Apr 02, 2025•9 min
Dear LessWrong community, It is with a sense of... considerable cognitive dissonance that I announce a significant development regarding the future trajectory of LessWrong. After extensive internal deliberation, modeling of potential futures, projections of financial runways, and what I can only describe as a series of profoundly unexpected coordination challenges, the Lightcone Infrastructure team has agreed in principle to the acquisition of LessWrong by EA. I assure you, nothing about how Les...
Apr 01, 2025•2 min
Our community is not prepared for an AI crash. We're good at tracking new capability developments, but not as much the company financials. Currently, both OpenAI and Anthropic are losing $5 billion+ a year, while under threat of losing users to cheap LLMs. A crash will weaken the labs. Funding-deprived and distracted, execs struggle to counter coordinated efforts to restrict their reckless actions. Journalists turn on tech darlings. Optimism makes way for mass outrage, for all the wasted money a...
Apr 01, 2025•4 min
Epistemic status: Reasonably confident in the basic mechanism. Have you noticed that you keep encountering the same ideas over and over? You read another post, and someone helpfully points out it's just old Paul's idea again. Or Eliezer's idea. Not much progress here, move along. Or perhaps you've been on the other side: excitedly telling a friend about some fascinating new insight, only to hear back, "Ah, that's just another version of X." And something feels not quite right about that response...
Mar 29, 2025•6 min
[This is our blog post on the papers, which can be found at https://transformer-circuits.pub/2025/attribution-graphs/biology.html and https://transformer-circuits.pub/2025/attribution-graphs/methods.html.] Language models like Claude aren't programmed directly by humans—instead, they‘re trained on large amounts of data. During that training process, they learn their own strategies to solve problems. These strategies are encoded in the billions of computations a model performs for every word it w...
Mar 28, 2025•22 min
About nine months ago, I and three friends decided that AI had gotten good enough to monitor large codebases autonomously for security problems. We started a company around this, trying to leverage the latest AI models to create a tool that could replace at least a good chunk of the value of human pentesters. We have been working on this project since since June 2024. Within the first three months of our company's existence, Claude 3.5 sonnet was released. Just by switching the portions of our s...
Mar 25, 2025•14 min
(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app. This is the fourth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.) 1. Introduction and summary In my last essay, I offered a high-level framework for thinking ...
Mar 25, 2025•34 min
LessWrong has been receiving an increasing number of posts and contents that look like they might be LLM-written or partially-LLM-written, so we're adopting a policy. This could be changed based on feedback. Humans Using AI as Writing or Research Assistants Prompting a language model to write an essay and copy-pasting the result will not typically meet LessWrong's standards. Please do not submit unedited or lightly-edited LLM content. You can use AI as a writing or research assistant when writin...
Mar 25, 2025•4 min
Thanks to Jesse Richardson for discussion. Polymarket asks: will Jesus Christ return in 2025? In the three days since the market opened, traders have wagered over $100,000 on this question. The market traded as high as 5%, and is now stably trading at 3%. Right now, if you wanted to, you could place a bet that Jesus Christ will not return this year, and earn over $13,000 if you're right. There are two mysteries here: an easy one, and a harder one. The easy mystery is: if people are willing to be...
Mar 25, 2025•8 min
TL;DR Having a good research track record is some evidence of good big-picture takes, but it's weak evidence. Strategic thinking is hard, and requires different skills. But people often conflate these skills, leading to excessive deference to researchers in the field, without evidence that that person is good at strategic thinking specifically. Introduction I often find myself giving talks or Q&As about mechanistic interpretability research. But inevitably, I'll get questions about the big p...
Mar 23, 2025•7 min
When my son was three, we enrolled him in a study of a vision condition that runs in my family. They wanted us to put an eyepatch on him for part of each day, with a little sensor object that went under the patch and detected body heat to record when we were doing it. They paid for his first pair of glasses and all the eye doctor visits to check up on how he was coming along, plus every time we brought him in we got fifty bucks in Amazon gift credit. I reiterate, he was three. (To begin with. Hi...
Mar 22, 2025•4 min
I’m releasing a new paper “Superintelligence Strategy” alongside Eric Schmidt (formerly Google), and Alexandr Wang (Scale AI). Below is the executive summary, followed by additional commentary highlighting portions of the paper which might be relevant to this collection of readers. Executive Summary Rapid advances in AI are poised to reshape nearly every aspect of society. Governments see in these dual-use AI systems a means to military dominance, stoking a bitter race to maximize AI capabilitie...
Mar 22, 2025•9 min
This is a link post. Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks. Full paper | Github repo --- First published: March ...
Mar 19, 2025•1 min
I have, over the last year, become fairly well-known in a small corner of the internet tangentially related to AI. As a result, I've begun making what I would have previously considered astronomical amounts of money: several hundred thousand dollars per month in personal income. This has been great, obviously, and the funds have alleviated a fair number of my personal burdens (mostly related to poverty). But aside from that I don't really care much for the money itself. My long term ambitions ha...
Mar 19, 2025•2 min
Note: this is a research note based on observations from evaluating Claude Sonnet 3.7. We’re sharing the results of these ‘work-in-progress’ investigations as we think they are timely and will be informative for other evaluators and decision-makers. The analysis is less rigorous than our standard for a published paper. Summary We monitor Sonnet's reasoning for mentions that it is in an artificial scenario or an alignment test. Claude Sonnet 3.7 appears to be aware of being tested for alignment a...
Mar 18, 2025•18 min
Scott Alexander famously warned us to Beware Trivial Inconveniences. When you make a thing easy to do, people often do vastly more of it. When you put up barriers, even highly solvable ones, people often do vastly less. Let us take this seriously, and carefully choose what inconveniences to put where. Let us also take seriously that when AI or other things reduce frictions, or change the relative severity of frictions, various things might break or require adjustment. This applies to all system ...
Mar 18, 2025•23 min
There's this popular trope in fiction about a character being mind controlled without losing awareness of what's happening. Think Jessica Jones, The Manchurian Candidate or Bioshock. The villain uses some magical technology to take control of your brain - but only the part of your brain that's responsible for motor control. You remain conscious and experience everything with full clarity. If it's a children's story, the villain makes you do embarrassing things like walk through the street naked,...
Mar 17, 2025•7 min
This research was conducted at AE Studio and supported by the AI Safety Grants programme administered by Foresight Institute with additional support from AE Studio. Summary In this post, we summarise the main experimental results from our new paper, "Towards Safe and Honest AI Agents with Neural Self-Other Overlap", which we presented orally at the Safe Generative AI Workshop at NeurIPS 2024. This is a follow-up to our post Self-Other Overlap: A Neglected Approach to AI Alignment, which introduc...
Mar 17, 2025•12 min
We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden obje...
Mar 16, 2025•24 min
The Most Forbidden Technique is training an AI using interpretability techniques. An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that. You train on [X]. Only [X]. Never [M], never [T]. Why? Because [T] is how you figure out when the model is misbehaving. If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know ...
Mar 14, 2025•32 min
You learn the rules as soon as you’re old enough to speak. Don’t talk to jabberjays. You recite them as soon as you wake up every morning. Keep your eyes off screensnakes. Your mother chooses a dozen to quiz you on each day before you’re allowed lunch. Glitchers aren’t human any more; if you see one, run. Before you sleep, you run through the whole list again, finishing every time with the single most important prohibition. Above all, never look at the night sky. You’re a precocious child. You e...
Mar 13, 2025•22 min
Exciting Update: OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I sketched out here. tl;dr: 1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper: The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine so...
Mar 11, 2025•7 min
LLM-based coding-assistance tools have been out for ~2 years now. Many developers have been reporting that this is dramatically increasing their productivity, up to 5x'ing/10x'ing it. It seems clear that this multiplier isn't field-wide, at least. There's no corresponding increase in output, after all. This would make sense. If you're doing anything nontrivial (i. e., anything other than adding minor boilerplate features to your codebase), LLM tools are fiddly. Out-of-the-box solutions don't Jus...
Mar 09, 2025•7 min
Background: After the release of Claude 3.7 Sonnet,[1] an Anthropic employee started livestreaming Claude trying to play through Pokémon Red. The livestream is still going right now. TL:DR: So, how's it doing? Well, pretty badly. Worse than a 6-year-old would, definitely not PhD-level. Digging in But wait! you say. Didn't Anthropic publish a benchmark showing Claude isn't half-bad at Pokémon? Why yes they did: and the data shown is believable. Currently, the livestream is on its third attempt, w...
Mar 09, 2025•9 min
Note: an audio narration is not available for this article. Please see the original text. The original text contained 169 footnotes which were omitted from this narration. The original text contained 79 images which were described by AI. --- First published: March 3rd, 2025 Source: https://www.lesswrong.com/posts/2w6hjptanQ3cDyDw7/methods-for-strong-human-germline-engineering --- Narrated by TYPE III AUDIO . --- Images from the article:...
Mar 07, 2025•18 sec
In a recent post, Cole Wyeth makes a bold claim: . . . there is one crucial test (yes this is a crux) that LLMs have not passed. They have never done anything important. They haven't proven any theorems that anyone cares about. They haven't written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science[3]. I comment...
Mar 06, 2025•4 min