This is the abstract and introduction of our new paper, with some discussion of implications for AI Safety at the end. Authors: Jan Betley*, Xuchan Bao*, Martín Soto*, Anna Sztyber-Betley, James Chua, Owain Evans (*Equal Contribution). Abstract We study behavioral self-awareness — an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting in...
Jan 28, 2025•14 min
The Cake Imagine that I want to bake a chocolate cake, and my sole goal in my entire lightcone and extended mathematical universe is to bake that cake. I care about nothing else. If the oven ends up a molten pile of metal ten minutes after the cake is done, if the leftover eggs are shattered and the leftover milk spilled, that's fine. Baking that cake is my terminal goal. In the process of baking the cake, I check my fridge and cupboard for ingredients. I have milk and eggs and flour, but no coc...
Jan 27, 2025•10 min
This post offers an accessible model of psychology of character-trained LLMs like Claude. Epistemic Status This is primarily a phenomenological model based on extensive interactions with LLMs, particularly Claude. It's intentionally anthropomorphic in cases where I believe human psychological concepts lead to useful intuitions. Think of it as closer to psychology than neuroscience - the goal isn't a map which matches the territory in the detail, but a rough sketch with evocative names which hope...
Jan 26, 2025•18 min
This is a link post.This is a blog post reporting some preliminary work from the Anthropic Alignment Science team, which might be of interest to researchers working actively in this space. We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments at a lab meeting, rather than a mature paper. We report a demonstration of a form of Out-of-Context Reasoning where training on documents which discuss (but don’t demonstrate) Claude's tendency to rew...
Jan 24, 2025•5 min
One hope for keeping existential risks low is to get AI companies to (successfully) make high-assurance safety cases: structured and auditable arguments that an AI system is very unlikely to result in existential risks given how it will be deployed.[1] Concretely, once AIs are quite powerful, high-assurance safety cases would require making a thorough argument that the level of (existential) risk caused by the company is very low; perhaps they would require that the total chance of existential r...
Jan 24, 2025•25 min
Cross-posted from Telescopic Turnip As we all know, humans are terrible at building butterflies. We can make a lot of objectively cool things like nuclear reactors and microchips, but we still can't create a proper artificial insect that flies, feeds, and lays eggs that turn into more butterflies. That seems like evidence that butterflies are incredibly complex machines – certainly more complex than a nuclear power facility. Likewise, when you google "most complex object in the universe", the fi...
Jan 24, 2025•29 min
This is a link post.A story I wrote about living through the transition to utopia. This is the one story that I've put the most time and effort into; it charts a course from the near future all the way to the distant stars. --- First published: January 19th, 2025 Source: https://www.lesswrong.com/posts/Rz4ijbeKgPAaedg3n/the-gentle-romance --- Narrated by TYPE III AUDIO ....
Jan 22, 2025•34 sec
This is a link post.Present alongside President Trump: Sam Altman Larry Ellison (Oracle executive chairman and CTO) Masayoshi Son (Softbank CEO who believes he was born to realize ASI) President Trump: What we want to do is we want to keep [AI datacenters] in this country. China is a competitor and others are competitors. President Trump: I'm going to help a lot through emergency declarations because we have an emergency. We have to get this stuff built. So they have to produce a lot of electric...
Jan 22, 2025•3 min
The AI Control Agenda, in its own words: … we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs...
Jan 21, 2025•13 min
I think a lot of people have heard so much about internalized prejudice and bias that they think they should ignore any bad vibes they get about a person that they can’t rationally explain. But if a person gives you a bad feeling, don’t ignore that. Both I and several others who I know have generally come to regret it if they’ve gotten a bad feeling about somebody and ignored it or rationalized it away. I’m not saying to endorse prejudice. But my experience is that many types of prejudice feel m...
Jan 20, 2025•3 min
(Both characters are fictional, loosely inspired by various traits from various real people. Be careful about combining kratom and alcohol.) The original text contained 24 images which were described by AI. --- First published: January 7th, 2025 Source: https://www.lesswrong.com/posts/KfZ4H9EBLt8kbBARZ/fiction-comic-effective-altruism-and-rationality-meet-at-a --- Narrated by TYPE III AUDIO . --- Images from the article:...
Jan 19, 2025•5 min
From AI scientist to AI research fleet Research automation is here (1, 2, 3). We saw it coming and planned ahead, which puts us ahead of most (4, 5, 6). But that foresight also comes with a set of outdated expectations that are holding us back. In particular, research automation is not just about “aligning the first AI scientist”, it's also about the institution-building problem of coordinating the first AI research fleets. Research automation is not about developing a plug-and-play “AI scientis...
Jan 18, 2025•10 min
So we want to align future AGIs. Ultimately we’d like to align them to human values, but in the shorter term we might start with other targets, like e.g. corrigibility. That problem description all makes sense on a hand-wavy intuitive level, but once we get concrete and dig into technical details… wait, what exactly is the goal again? When we say we want to “align AGI”, what does that mean? And what about these “human values” - it's easy to list things which are importantly not human values (lik...
Jan 17, 2025•46 min
Traditional economics thinking has two strong principles, each based on abundant historical data: Principle (A): No “lump of labor”: If human population goes up, there might be some wage drop in the very short term, because the demand curve for labor slopes down. But in the longer term, people will find new productive things to do, such that human labor will retain high value. Indeed, if anything, the value of labor will go up, not down—for example, dense cities are engines of economic growth! P...
Jan 14, 2025•6 min
All quotes, unless otherwise marked, are Tolkien's words as printed in The Letters of J.R.R.Tolkien: Revised and Expanded Edition. All emphases mine. Machinery is Power is Evil Writing to his son Michael in the RAF: [here is] the tragedy and despair of all machinery laid bare. Unlike art which is content to create a new secondary world in the mind, it attempts to actualize desire, and so to create power in this World; and that cannot really be done with any real satisfaction. Labour-saving machi...
Jan 14, 2025•57 min
The anonymous review of The Anti-Politics Machine published on Astral Codex X focuses on a case study of a World Bank intervention in Lesotho, and tells a story about it: The World Bank staff drew reasonable-seeming conclusions from sparse data, and made well-intentioned recommendations on that basis. However, the recommended programs failed, due to factors that would have been revealed by a careful historical and ethnographic investigation of the area in question. Therefore, we should spend mor...
Jan 13, 2025•15 min
Crossposted from my personal blog. I was inspired to cross-post this here given the discussion that this post on the role of capital in an AI future elicited. When discussing the future of AI, I semi-often hear an argument along the lines that in a slow takeoff world, despite AIs automating increasingly more of the economy, humanity will remain in the driving seat because of its ownership of capital. This world posits one where humanity effectively becomes a rentier class living well off the vas...
Jan 11, 2025•25 min
TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neural networks by decomposing their individual activation spaces in isolation: It seems likely to find features of the activations - features that help explain the statistical structure of activation spaces, rather than features of the model - the features the model's own computations make use of. Written at Apollo Research Introduction Claim: Activation space interpretability is likely to give us f...
Jan 10, 2025•16 min
Funding for $150bn training systems just turned less speculative, with OpenAI o3 reaching 25% on FrontierMath, 70% on SWE-Verified, 2700 on Codeforces, and 80% on ARC-AGI. These systems will be built in 2026-2027 and enable pretraining models for 5e28 FLOPs, while o3 itself is plausibly based on an LLM pretrained only for 8e25-4e26 FLOPs. The natural text data wall won't seriously interfere until 6e27 FLOPs, and might be possible to push until 5e28 FLOPs. Scaling of pretraining won't end just ye...
Jan 09, 2025•9 min
(Cross-post from https://amistrongeryet.substack.com/p/are-we-on-the-brink-of-agi, lightly edited for LessWrong. The original has a lengthier introduction and a bit more explanation of jargon.) No one seems to know whether transformational AGI is coming within a few short years. Or rather, everyone seems to know, but they all have conflicting opinions. Have we entered into what will in hindsight be not even the early stages, but actually the middle stage, of the mad tumbling rush into singularit...
Jan 09, 2025•25 min
I mostly work on risks from scheming (that is, misaligned, power-seeking AIs that plot against their creators such as by faking alignment). Recently, I (and co-authors) released "Alignment Faking in Large Language Models", which provides empirical evidence for some components of the scheming threat model. One question that's really important is how likely scheming is. But it's also really important to know how much we expect this uncertainty to be resolved by various key points in the future. I ...
Jan 08, 2025•1 hr 19 min
This week, Altman offers a post called Reflections, and he has an interview in Bloomberg. There's a bunch of good and interesting answers in the interview about past events that I won’t mention or have to condense a lot here, such as his going over his calendar and all the meetings he constantly has, so consider reading the whole thing. Table of Contents The Battle of the Board. Altman Lashes Out. Inconsistently Candid. On Various People Leaving OpenAI. The Pitch. Great Expectations. Accusations...
Jan 08, 2025•20 min
As someone who writes for fun, I don't need to get people onto my site: If I write a post and some people are able to get the core ideajust from the title or a tweet-length summary, great! I can include the full contents of my posts in my RSS feed andon FB, because so what if people read the whole post there and neverclick though to my site? It would be different if I funded my writing through ads (maximizetime on site to maximize impressions) or subscriptions (get the chanceto pitch, probably w...
Jan 07, 2025•2 min
This is a low-effort post. I mostly want to get other people's takes and express concern about the lack of detailed and publicly available plans so far. This post reflects my personal opinion and not necessarily that of other members of Apollo Research. I’d like to thank Ryan Greenblatt, Bronson Schoen, Josh Clymer, Buck Shlegeris, Dan Braun, Mikita Balesni, Jérémy Scheurer, and Cody Rushing for comments and discussion. I think short timelines, e.g. AIs that can replace a top researcher at an AG...
Jan 02, 2025•44 min
from aisafety.world The following is a list of live agendas in technical AI safety, updating our post from last year. It is “shallow” in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about an hour on each entry. We also only use public information, so we are bound to be off by some additional factor. The point is to help anyone look up some of what is happening, or that thing you vaguely remember reading about; to help new researchers orient and know (som...
Dec 30, 2024•1 hr 57 min
I've heard many people say something like "money won't matter post-AGI". This has always struck me as odd, and as most likely completely incorrect. First: labour means human mental and physical effort that produces something of value. Capital goods are things like factories, data centres, and software—things humans have built that are used in the production of goods and services. I'll use "capital" to refer to both the stock of capital goods and to the money that can pay for them. I'll say "mone...
Dec 29, 2024•29 min
Take a stereotypical fantasy novel, a textbook on mathematical logic, and Fifty Shades of Grey. Mix them all together and add extra weirdness for spice. The result might look a lot like Planecrash (AKA: Project Lawful), a work of fiction co-written by "Iarwain" (a pen-name of Eliezer Yudkowsky) and "lintamande". (image from Planecrash) Yudkowsky is not afraid to be verbose and self-indulgent in his writing. He previously wrote a Harry Potter fanfic that includes what's essentially an extended En...
Dec 28, 2024•39 min
A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is". Over the past few years, a major source of my relative optimism on AI has been the ho...
Dec 26, 2024•14 min
TL;DR: If you want to know whether getting insurance is worth it, use the Kelly Insurance Calculator. If you want to know why or how, read on. Note to LW readers: this is almost the entire article, except some additional maths that I couldn't figure out how to get right in the LW editor, and margin notes. If you're very curious, read the original article! Misunderstandings about insurance People online sometimes ask if they should get some insurance, and then other people say incorrect things, l...
Dec 23, 2024•11 min
My median expectation is that AGI[1] will be created 3 years from now. This has implications on how to behave, and I will share some useful thoughts I and others have had on how to orient to short timelines. I’ve led multiple small workshops on orienting to short AGI timelines and compiled the wisdom of around 50 participants (but mostly my thoughts) here. I’ve also participated in multiple short-timelines AGI wargames and co-led one wargame. This post will assume median AGI timelines of 2027 an...
Dec 23, 2024•15 min