https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others.
Mar 08, 2023•41 min
https://www.lesswrong.com/posts/3RSq3bfnzuL3sp46J/acausal-normalcy Crossposted from the AI Alignment Forum . May contain more technical jargon than usual. This post is also available on the EA Forum . Summary: Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic. I say this to be comforting rather than dismissive; if it...
Mar 06, 2023•16 min
https://www.lesswrong.com/posts/RryyWNmJNnLowbhfC/please-don-t-throw-your-mind-away [Warning: the following dialogue contains an incidental spoiler for "Music in Human Evolution" by Kevin Simler . That post is short, good, and worth reading without spoilers, and this post will still be here if you come back later. It's also possible to get the point of this post by skipping the dialogue and reading the other sections.] Pretty often, talking to someone who's arriving to the existential risk / AGI...
Mar 01, 2023•33 min
https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism There is a lot of disagreement and confusion about the feasibility and risks associated with automating alignment research. Some see it as the default path toward building aligned AI, while others expect limited benefit from near term systems, expecting the ability to significantly speed up progress to appear well after misalignment and deception. Furthermore, progress in this area may directly shorten timelines or enable the creation o...
Feb 15, 2023•1 hr 17 min
https://www.lesswrong.com/posts/CYN7swrefEss4e3Qe/childhoods-of-exceptional-people This is a linkpost for https://escapingflatland.substack.com/p/childhoods Let’s start with one of those insights that are as obvious as they are easy to forget: if you want to master something, you should study the highest achievements of your field. If you want to learn writing, read great writers, etc. But this is not what parents usually do when they think about how to educate their kids. The default for a pare...
Feb 14, 2023•28 min
https://www.lesswrong.com/posts/NJYmovr9ZZAyyTBwM/what-i-mean-by-alignment-is-in-large-part-about-making Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. (Epistemic status: attempting to clear up a misunderstanding about points I have attempted to make in the past. This post is not intended as an argument for those points.) I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at al...
Feb 13, 2023•5 min
https://www.lesswrong.com/posts/NRrbJJWnaSorrqvtZ/on-not-getting-contaminated-by-the-wrong-obesity-ideas A Chemical Hunger (a), a series by the authors of the blog Slime Mold Time Mold (SMTM), argues that the obesity epidemic is entirely caused (a) by environmental contaminants. In my last post, I investigated SMTM’s main suspect (lithium).[1] This post collects other observations I have made about SMTM’s work, not narrowly related to lithium, but rather focused on the broader thesis of their bl...
Feb 10, 2023•1 hr 13 min
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation Work done at SERI-MATS , over the past two months, by Jessica Rumbelow and Matthew Watkins. TL;DR Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew) We have found a set of anomalous tokens which result in a previously undocumented failure mode for GPT-2 and GPT-3 models. (The 'instruct' models “are particularly deranged” in this context, as janus has observed.) Many of th...
Feb 08, 2023•34 min
https://www.lesswrong.com/posts/Zp6wG5eQFLGWwcG6j/focus-on-the-places-where-you-feel-shocked-everyone-s Writing down something I’ve found myself repeating in different conversations: If you're looking for ways to help with the whole “the world looks pretty doomed” business, here's my advice: look around for places where we're all being total idiots. Look for places where everyone's fretting about a problem that some part of you thinks it could obviously just solve. Look around for places where s...
Feb 03, 2023•7 min
https://www.lesswrong.com/posts/XPv4sYrKnPzeJASuk/basics-of-rationalist-discourse-1 Introduction This post is meant to be a linkable resource. Its core is a short list of guidelines (you can link directly to the list) that are intended to be fairly straightforward and uncontroversial, for the purpose of nurturing and strengthening a culture of clear thinking, clear communication, and collaborative truth-seeking. "Alas," said Dumbledore, "we all know that what should be , and what is , are two di...
Feb 02, 2023•1 hr 7 min
https://www.lesswrong.com/posts/PCrTQDbciG4oLgmQ5/sapir-whorf-for-rationalists Casus Belli: As I was scanning over my (rather long) list of essays-to-write, I realized that roughly a fifth of them were of the form "here's a useful standalone concept I'd like to reify," à la cup-stacking skills , fabricated options , split and commit , and sazen . Some notable entries on that list (which I name here mostly in the hope of someday coming back and turning them into links) include: red vs. white, wal...
Jan 31, 2023•39 min
https://www.lesswrong.com/posts/pDzdb4smpzT3Lwbym/my-model-of-ea-burnout (Probably somebody else has said most of this. But I personally haven't read it, and felt like writing it down myself, so here we go.) I think that EA [editor note: "Effective Altruism"] burnout usually results from prolonged dedication to satisfying the values you think you should have, while neglecting the values you actually have. Setting aside for the moment what “values” are and what it means to “actually” have one, su...
Jan 31, 2023•9 min
https://www.lesswrong.com/posts/Xo7qmDakxiizG7B9c/the-social-recession-by-the-numbers This is a linkpost for https://novum.substack.com/p/social-recession-by-the-numbers Fewer friends, relationships on the decline, delayed adulthood, trust at an all-time low, and many diseases of despair. The prognosis is not great. One of the most discussed topics online recently has been friendships and loneliness. Ever since the infamous chart showing more people are not having sex than ever before first made...
Jan 25, 2023•23 min
https://www.lesswrong.com/posts/pHfPvb4JMhGDr4B7n/recursive-middle-manager-hell I think Zvi's Immoral Mazes sequence is really important, but comes with more worldview-assumptions than are necessary to make the points actionable. I conceptualize Zvi as arguing for multiple hypotheses. In this post I want to articulate one sub-hypothesis, which I call "Recursive Middle Manager Hell". I'm deliberately not covering some other components of his model [1] . tl;dr: Something weird and kinda horrifying...
Jan 24, 2023•21 min
https://www.lesswrong.com/posts/mfPHTWsFhzmcXw8ta/the-feeling-of-idea-scarcity Here’s a story you may recognize. There's a bright up-and-coming young person - let's call her Alice. Alice has a cool idea. It seems like maybe an important idea, a big idea, an idea which might matter. A new and valuable idea. It’s the first time Alice has come up with a high-potential idea herself, something which she’s never heard in a class or read in a book or what have you. So Alice goes all-in pursuing this id...
Jan 12, 2023•9 min
https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward Crossposted from the AI Alignment Forum . May contain more technical jargon than usual. In terms of content, this has a lot of overlap with Reward is not the optimization target . I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight When thinking about deception and RLHF training, a simplified threat model is something like this: A model takes some ...
Jan 12, 2023•10 min
https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without Crossposted from the AI Alignment Forum . May contain more technical jargon than usual. Introduction A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision . For a quick summary of our paper, you can check out this Twitter thread . In this post I will describe how I think the results and methods in our paper fit into a...
Jan 12, 2023•34 min
https://www.lesswrong.com/posts/qRtD4WqKRYEtT5pi3/the-next-decades-might-be-wild Crossposted from the AI Alignment Forum . May contain more technical jargon than usual. I’d like to thank Simon Grimm and Tamay Besiroglu for feedback and discussions. This post is inspired by What 2026 looks like and an AI vignette workshop guided by Tamay Besiroglu. I think of this post as “what would I expect the world to look like if these timelines (median compute for transformative AI ~2036) were true” or “wha...
Dec 21, 2022•1 hr 19 min
https://www.lesswrong.com/posts/SqjQFhn5KTarfW8v7/lessons-learned-from-talking-to-greater-than-100-academics Crossposted from the AI Alignment Forum . May contain more technical jargon than usual. I’d like to thank MH, Jaime Sevilla and Tamay Besiroglu for their feedback. During my Master's and Ph.D. (still ongoing), I have spoken with many academics about AI safety. These conversations include chats with individual PhDs, poster presentations and talks about AI safety. I think I have learned a l...
Nov 17, 2022•26 min
https://www.lesswrong.com/posts/6LzKRP88mhL9NKNrS/how-my-team-at-lightcone-sometimes-gets-stuff-done Disclaimer: I originally wrote this as a private doc for the Lightcone team. I then showed it to John and he said he would pay me to post it here. That sounded awfully compelling. However, I wanted to note that I’m an early founder who hasn't built anything truly great yet. I’m writing this doc because as Lightcone is growing, I have to take a stance on these questions. I need to design our org t...
Nov 10, 2022•14 min
https://www.lesswrong.com/posts/rP66bz34crvDudzcJ/decision-theory-does-not-imply-that-we-get-to-have-nice Crossposted from the AI Alignment Forum . May contain more technical jargon than usual. ( Note: I wrote this with editing help from Rob and Eliezer. Eliezer's responsible for a few of the paragraphs. ) A common confusion I see in the tiny fragment of the world that knows about logical decision theory (FDT/UDT/etc.), is that people think LDT agents are genial and friendly for each other. [1] ...
Nov 08, 2022•57 min
https://www.lesswrong.com/posts/6Xgy6CAf2jqHhynHL/what-2026-looks-like#2022 Crossposted from the AI Alignment Forum . May contain more technical jargon than usual. This was written for the Vignettes Workshop . [1] The goal is to write out a detailed future history (“trajectory”) that is as realistic (to me) as I can currently manage, i.e. I’m not aware of any alternative trajectory that is similarly detailed and clearly more plausible to me. The methodology is roughly: Write a future history of ...
Nov 07, 2022•37 min
Nov 04, 2022•1 hr 15 min
https://www.lesswrong.com/posts/REA49tL5jsh69X3aM/introduction-to-abstract-entropy#fnrefpi8b39u5hd7 This post, and much of the following sequence, was greatly aided by feedback from the following people (among others): Lawrence Chan , Joanna Morningstar , John Wentworth , Samira Nedungadi , Aysja Johnson , Cody Wild , Jeremy Gillen , Ryan Kidd , Justis Mills and Jonathan Mustin . Illustrations by Anne Ore. Introduction & motivation In the course of researching optimization, I decided that I ...
Oct 29, 2022•46 min
https://www.lesswrong.com/posts/8vesjeKybhRggaEpT/consider-your-appetite-for-disagreements Poker There was a time about five years ago where I was trying to get good at poker. If you want to get good at poker, one thing you have to do is review hands. Preferably with other people. For example, suppose you have ace king offsuit on the button. Someone in the highjack opens to 3 big blinds preflop. You call. Everyone else folds. The flop is dealt. It's a rainbow Q75. You don't have any flush draws....
Oct 25, 2022•11 min
https://www.lesswrong.com/posts/fFY2HeC9i2Tx8FEnK/my-resentful-story-of-becoming-a-medical-miracle This is a linkpost for https://acesounderglass.com/2022/10/13/my-resentful-story-of-becoming-a-medical-miracle/ You know those health books with “miracle cure” in the subtitle? The ones that always start with a preface about a particular patient who was completely hopeless until they tried the supplement/meditation technique/healing crystal that the book is based on? These people always start broke...
Oct 21, 2022•24 min
https://www.lesswrong.com/posts/CKgPFHoWFkviYz7CB/the-redaction-machine On the 3rd of October 2351 a machine flared to life. Huge energies coursed into it via cables, only to leave moments later as heat dumped unwanted into its radiators. With an enormous puff the machine unleashed sixty years of human metabolic entropy into superheated steam. In the heart of the machine was Jane, a person of the early 21st century. From her perspective there was no transition. One moment she had been in the yea...
Oct 02, 2022•59 min
https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to Crossposted from the AI Alignment Forum . May contain more technical jargon than usual. I think that in the coming 15-30 years , the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D . This sort of AI could be sufficient to make this the most important century of all ti...
Sep 27, 2022•3 hr 8 min
https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values TL;DR: We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry....
Sep 22, 2022•1 hr 9 min
https://www.lesswrong.com/posts/AfH2oPHCApdKicM4m/two-year-update-on-my-personal-ai-timelines#fnref-fwwPpQFdWM6hJqwuY-12 Crossposted from the AI Alignment Forum . May contain more technical jargon than usual. I worked on my draft report on biological anchors for forecasting AI timelines mainly between ~May 2019 (three months after the release of GPT-2) and ~Jul 2020 (a month after the release of GPT-3), and posted it on LessWrong in Sep 2020 after an internal review process. At the time, my bott...
Sep 22, 2022•39 min