Epistemic status: these are my own opinions on AI risk communication, based primarily on my own instincts on the subject and discussions with people less involved with rationality than myself. Communication is highly subjective and I have not rigorously A/B tested messaging. I am even less confident in the quality of my responses than in the correctness of my critique. If they turn out to be true, these thoughts can probably be applied to all sorts of communication beyond AI risk. Lots of work h...
Sep 11, 2024•9 min
In my last post, I wrote that no resource out there exactly captured my model of epistemology, which is why I wanted to share a half-baked version of it. But I do have one book which I always recommend to people who want to learn more about epistemology: Inventing Temperature by Hasok Chang. To be very clear, my recommendation is not just to get the good ideas from this book (of which there are many) from a book review or summary — it's to actually read the book, the old-school way, one word at ...
Sep 10, 2024•5 min
Our new video is an adaptation of That Alien Message, by @Eliezer Yudkowsky. This time, the text has been significantly adapted, so I include it below. Part 1 Picture a world just like ours, except the people are a fair bit smarter: in this world, Einstein isn’t one in a million, he's one in a thousand. In fact, here he is now. He's made all the same discoveries, but they’re not quite as unusual: there have been lots of other discoveries. Anyway, he's out one night with a friend looking up at th...
Sep 09, 2024•15 min
Personally, I suspect the alignment problem is hard. But even if it turns out to be easy, survival may still require getting at least the absolute basics right; currently, I think we're mostly failing even at that. Early discussion of AI risk often focused on debating the viability of various elaborate safety schemes humanity might someday devise—designing AI systems to be more like “tools” than “agents,” for example, or as purely question-answering oracles locked within some kryptonite-style bo...
Sep 07, 2024•2 min
Intro In April 2024, my colleague and I (both affiliated with Peking University) conducted a survey involving 510 students from Tsinghua University and 518 students from Peking University—China's two top academic institutions. Our focus was on their perspectives regarding the frontier risks of artificial intelligence. In the People's Republic of China (PRC), publicly accessible survey data on AI is relatively rare, so we hope this report provides some valuable insights into how people in the PRC...
Sep 07, 2024•24 min
Paging Gwern or anyone else who can shed light on the current state of the AI market—I have several questions. Since the release of ChatGPT, at least 17 companies, according to the LMSYS Chatbot Arena Leaderboard, have developed AI models that outperform it. These companies include Anthropic, NexusFlow, Microsoft, Mistral, Alibaba, Hugging Face, Google, Reka AI, Cohere, Meta, 01 AI, AI21 Labs, Zhipu AI, Nvidia, DeepSeek, and xAI. Since GPT-4's launch, 15 different companies have reportedly creat...
Sep 02, 2024•4 min
If you ask the internet if breastfeeding is good, you will soon learn that YOU MUST BREASTFEED because BREAST MILK = OPTIMAL FOOD FOR BABY. But if you look for evidence, you’ll discover two disturbing facts. First, there's no consensus about why breastfeeding is good. I’ve seen experts suggest at least eight possible mechanisms: Formula can’t fully reproduce the complex blend of fats, proteins and sugars in breast milk. Formula lacks various bio-active things in breast milk, like antibodies, whi...
Sep 01, 2024•18 min
Crossposted from https://williamrsaunders.substack.com/p/principles-for-the-agi-race Why form principles for the AGI Race? I worked at OpenAI for 3 years, on the Alignment and Superalignment teams. Our goal was to prepare for the possibility that OpenAI succeeded in its stated mission of building AGI (Artificial General Intelligence, roughly able to do most things a human can do), and then proceed on to make systems smarter than most humans. This will predictably face novel problems in controlli...
Aug 31, 2024•31 min
Two new The Information articles with insider information on OpenAI's next models and moves. They are paywalled, but here are the new bits of information: Strawberry is more expensive and slow at inference time, but can solve complex problems on the first try without hallucinations. It seems to be an application or extension of process supervision Its main purpose is to produce synthetic data for Orion, their next big LLM But now they are also pushing to get a distillation of Strawberry into Cha...
Aug 29, 2024•6 min
People often talk about “solving the alignment problem.” But what is it to do such a thing? I wanted to clarify my thinking about this topic, so I wrote up some notes. In brief, I’ll say that you’ve solved the alignment problem if you’ve: avoided a bad form of AI takeover, built the dangerous kind of superintelligent AI agents, gained access to the main benefits of superintelligence, and become able to elicit some significant portion of those benefits from some of the superintelligent AI agents ...
Aug 28, 2024•1 hr 39 min
In the past two years there has been increased interest in formal verification-based approaches to AI safety. Formal verification is a sub-field of computer science that studies how guarantees may be derived by deduction on fully-specified rule-sets and symbol systems. By contrast, the real world is a messy place that can rarely be straightforwardly represented in a reductionist way. In particular, physics, chemistry and biology are all complex sciences which do not have anything like complete s...
Aug 27, 2024•42 min
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. As an extreme thought experiment, I’ll argue this could be hard even if you caught your AI red-handed trying to escape. Imagine you're running an AI lab at ...
Aug 27, 2024•7 min
For many products, we face a choice of who to hold liable for harms that would not have occurred if not for the existence of the product. For instance, if a person uses a gun in a school shooting that kills a dozen people, there are many legal persons who in principle could be held liable for the harm: The shooter themselves, for obvious reasons. The shop that sold the shooter the weapon. The company that designs and manufactures the weapon. Which one of these is the best? I'll offer a brief and...
Aug 23, 2024•8 min
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours. Who are we? We’re the main team at Google DeepMind working on technical approaches to existential risk...
Aug 21, 2024•19 min
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.Is AI takeover like a nuclear meltdown? A coup? A plane crash? My day job is thinking about safety measures that aim to reduce catastrophic risks from AI (especially risks from egregious misalignment). The two main themes of this work are the design of such measures (what's the space of techniques we might expect to be affordable and effective) and their evaluation (how do we decide which sa...
Aug 15, 2024•20 min
[This article was originally published on Dan Elton's blog, More is Different.] Cerebrolysin is an unregulated medical product made from enzymatically digested pig brain tissue. Hundreds of scientific papers claim that it boosts BDNF, stimulates neurogenesis, and can help treat numerous neural diseases. It is widely used by doctors around the world, especially in Russia and China. A recent video of Bryan Johnson injecting Cerebrolysin has over a million views on X and 570,000 views on YouTube. T...
Aug 13, 2024•38 min
This work was produced at Apollo Research, based on initial research done at MATS. LayerNorm is annoying for mechanstic interpretability research (“[...] reason #78 for why interpretability researchers hate LayerNorm” – Anthropic, 2023). Here's a Hugging Face link to a GPT2-small model without any LayerNorm. The final model is only slightly worse than a GPT2 with LayerNorm[1]: DatasetOriginal GPT2Fine-tuned GPT2 with LayerNormFine-tuned GPT without LayerNormOpenWebText (ce_loss)3.0952.9893.014 (...
Aug 10, 2024•23 min
This is slightly old news at this point, but: as part of MIRI's recent strategy pivot, they've eliminated the Agent Foundations research team. I've been out of a job for a little over a month now. Much of my research time in the first half of the year was eaten up by engaging with the decision process that resulted in this, and later, applying to grants and looking for jobs. I haven't secured funding yet, but for my own sanity & happiness, I am (mostly) taking a break from worrying about tha...
Aug 09, 2024•4 min
This is a story about a flawed Manifold market, about how easy it is to buy significant objective-sounding publicity for your preferred politics, and about why I've downgraded my respect for all but the largest prediction markets. I've had a Manifold account for a while, but I didn't use it much until I saw and became irked by this market on the conditional probabilities of a Harris victory, split by VP pick. Jeb Bush? Really? That's not even a fun kind of wishful thinking for anyone. Please cla...
Aug 08, 2024•4 min
Cross-posted from Substack. 1. And the sky opened, and from the celestial firmament descended a cube of ivory the size of a skyscraper, lifted by ten thousand cherubim and seraphim. And the cube slowly landed among the children of men, crushing the frail metal beams of the Golden Gate Bridge under its supernatural weight. On its surface were inscribed the secret instructions that would allow humanity to escape the imminent AI apocalypse. And these instructions were… On July 30th, 2024: print a p...
Aug 07, 2024•16 min
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.What the heck is up with “corrigibility”? For most of my career, I had a sense that it was a grab-bag of properties that seemed nice in theory but hard to get in practice, perhaps due to being incompatible with agency. Then, last year, I spent some time revisiting my perspective, and I concluded that I had been deeply confused by what corrigibility even was. I now think that corrigibility is a single, intuitive...
Aug 07, 2024•20 min
Figure 1. Image generated by DALL-3 to represent the concept of self-other overlapMany thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work. Summary In this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance. There is a large body of evidence suggesting th...
Aug 07, 2024•23 min
TL;DR: Your discernment in a subject often improves as you dedicate time and attention to that subject. The space of possible subjects is huge, so on average your discernment is terrible, relative to what it could be. This is a serious problem if you create a machine that does everyone's job for them. See also: Reality has a surprising amount of detail. (You lack awareness of how bad your staircase is and precisely how your staircase is bad.) You don't know what you don't know. You forget your o...
Aug 07, 2024•9 min
This is a link post.Content warning: About an IRL death. Today's post isn’t so much an essay as a recommendation for two bodies of work on the same topic: Tom Mahood's blog posts and Adam “KarmaFrog1” Marsland's videos on the 2010 disappearance of Bill Ewasko, who went for a day hike in Joshua Tree National Park and dropped out of contact. 2010 – Bill Ewasko goes missing Tom Mahood's writeups on the search [Blog post, website goes down sometimes so if the site doesn’t work, check the internet ar...
Aug 07, 2024•22 min
NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position. “It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout” : Elhage et...
Aug 07, 2024•30 min
This is a link post.Google DeepMind reports on a system for solving mathematical problems that allegedly is able to give complete solutions to four of the six problems on the 2024 IMO, putting it near the top of the silver-medal category. Well, actually, two systems for solving mathematical problems: AlphaProof, which is more general-purpose, and AlphaGeometry, which is specifically for geometry problems. (This is AlphaGeometry 2; they reported earlier this year on a previous version of AlphaGeo...
Jul 30, 2024•4 min
This is a link post.What is an agent? It's a slippery concept with no commonly accepted formal definition, but informally the concept seems to be useful. One angle on it is Dennett's Intentional Stance: we think of an entity as being an agent if we can more easily predict it by treating it as having some beliefs and desires which guide its actions. Examples include cats and countries, but the central case is humans. The world is shaped significantly by the choices agents make. What might agents ...
Jul 29, 2024•24 min
(Crossposted from Twitter) I'm skeptical that Universal Basic Income can get rid of grinding poverty, since somehow humanity's 100-fold productivity increase (since the days of agriculture) didn't eliminate poverty. Some of my friends reply, "What do you mean, poverty is still around? 'Poor' people today, in Western countries, have a lot to legitimately be miserable about, don't get me wrong; but they also have amounts of clothing and fabric that only rich merchants could afford a thousand years...
Jul 27, 2024•16 min
Eliezer Yudkowsky periodically complains about people coming up with questionable plans with questionable assumptions to deal with AI, and then either: Saying "well, if this assumption doesn't hold, we're doomed, so we might as well assume it's true." Worse: coming up with cope-y reasons to assume that the assumption isn't even questionable at all. It's just a pretty reasonable worldview. Sometimes the questionable plan is "an alignment scheme, which Eliezer thinks avoids the hard part of the pr...
Jul 19, 2024•13 min
This post was inspired by some talks at the recent LessOnline conference including one by LessWrong user “Gene Smith”. Let's say you want to have a “designer baby”. Genetically extraordinary in some way — super athletic, super beautiful, whatever. 6’5”, blue eyes, with a trust fund. Ethics aside[1], what would be necessary to actually do this? Fundamentally, any kind of “superbaby” or “designer baby” project depends on two steps: 1.) figure out what genes you ideally want; 2.) create an embryo w...
Jul 15, 2024•19 min