In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp...
Jun 15, 2025•1 hr 41 min
Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axr...
Jun 06, 2025•2 hr 14 min
What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html Topics we discuss, and timestamps: 0:00:41 APD basics 0:07:57 Faithf...
Jun 03, 2025•2 hr 16 min
How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing. In this episode, I speak with Jason Gross about his agenda to benchmark interpretability in this way, and his exploration of the intersection of proofs and modern machine learning. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/03/28/episode-40-jason-gross-compact-proofs-in...
Mar 28, 2025•2 hr 36 min
In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier models can sabotage human decision-making or monitoring of the same models; and secondly, the difficult situation humans find themselves in in a post-AGI future, even if AI is aligned with human intentions. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/03/01/episode-38_...
Mar 01, 2025•21 min
The Future of Life Institute is one of the oldest and most prominant organizations in the AI existential safety space, working on such topics as the AI pause open letter and how the EU AI Act can be improved. Metaculus is one of the premier forecasting sites on the internet. Behind both of them lie one man: Anthony Aguirre, who I talk with in this episode. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/02/09/episode-38_7-...
Feb 09, 2025•23 min
Typically this podcast talks about how to avert destruction from AI. But what would it take to ensure AI promotes human flourishing as well as it can? Is alignment to individuals enough, and if not, where do we go form here? In this episode, I talk with Joel Lehman about these questions. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/01/24/episode-38_6-joel-lehman-positive-visions-of-ai.html FAR.AI: https://far.ai/ FAR.AI...
Jan 24, 2025•15 min
Suppose we're worried about AIs engaging in long-term plans that they don't tell us about. If we were to peek inside their brains, what should we look for to check whether this was happening? In this episode Adrià Garriga-Alonso talks about his work trying to answer this question. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/01/20/episode-38_5-adria-garriga-alonso-detecting-ai-scheming.html FAR.AI: https://far.ai/ FAR.A...
Jan 20, 2025•28 min
AI researchers often complain about the poor coverage of their work in the news media. But why is this happening, and how can it be fixed? In this episode, I speak with Shakeel Hashim about the resource constraints facing AI journalism, the disconnect between journalists' and AI researchers' views on transformative AI, and efforts to improve the state of AI journalism, such as Tarbell and Shakeel's newsletter, Transformer. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrp...
Jan 05, 2025•24 min
Lots of people in the AI safety space worry about models being able to make deliberate, multi-step plans. But can we already see this in existing neural nets? In this episode, I talk with Erik Jenner about his work looking at internal look-ahead within chess-playing neural networks. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/12/12/episode-38_3-erik-jenner-learned-look-ahead.html FAR.AI: https://far.ai/ FAR.AI on X...
Dec 12, 2024•24 min
The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he's worked on at Anthropic under this agenda: "Sleeper Agents" and "Sycophancy to Subterfuge". Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/12...
Dec 01, 2024•1 hr 46 min
You may have heard of singular learning theory, and its "local learning coefficient", or LLC - but have you heard of the refined LLC? In this episode, I chat with Jesse Hoogland about his work on SLT, and using the refined LLC to find a new circuit in language models. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/11/27/38_2-jesse-hoogland-singular-learning-theory.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter)...
Nov 27, 2024•18 min
Road lines, street lights, and licence plates are examples of infrastructure used to ensure that roads operate smoothly. In this episode, Alan Chan talks about using similar interventions to help avoid bad outcomes from the deployment of AI agents. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/11/16/episode-38_1-alan-chan-agent-infrastructure.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farai...
Nov 16, 2024•25 min
Do language models understand the causal structure of the world, or do they merely note correlations? And what happens when you build a big AI society out of them? In this brief episode, recorded at the Bay Area Alignment Workshop, I chat with Zhijing Jin about her research on these questions. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/11/14/episode-38_0-zhijing-jin-llms-causality-multi-agent-systems.html FAR.AI: ...
Nov 14, 2024•23 min
Epoch AI is the premier organization that tracks the trajectory of AI - how much compute is used, the role of algorithmic improvements, the growth in data used, and when the above trends might hit an end. In this episode, I speak with the director of Epoch AI, Jaime Sevilla, about how compute, data, and algorithmic improvements are impacting AI, and whether continuing to scale can get us AGI. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://...
Oct 04, 2024•1 hr 44 min
Sometimes, people talk about transformers as having "world models" as a result of being trained to predict text data on the internet. But what does this even mean? In this episode, I talk with Adam Shai and Paul Riechers about their work applying computational mechanics, a sub-field of physics studying how to predict random processes, to neural networks. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/09/29/episode-36-...
Sep 29, 2024•1 hr 48 min
Patreon: https://www.patreon.com/axrpodcast MATS: https://www.matsprogram.org Note: I'm employed by MATS, but they're not paying me to make this video.
Sep 28, 2024•6 min
How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit those locations to change the beliefs? Also, how are we going to get AI to perform tasks so hard that we can't figure out if they succeeded at them? In this episode, I chat with Peter Hase about his research into these questions. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/...
Aug 24, 2024•2 hr 17 min
How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html Topics we discuss, and timestamps: 0:00:37 - What is METR? 0:02:44 - W...
Jul 28, 2024•2 hr 14 min
Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/06/1...
Jun 12, 2024•1 hr 41 min
What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about these questions with Jan Kulveit, who leads the Alignment of Complex Systems research group. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: axrp.net/episode/2024/05/30/episode-32-understanding-agency-jan-kulveit.html Topics we discuss, and timestamps: 0:00:47 - What is active inference? 0:15:14 - Preferences in active i...
May 30, 2024•2 hr 22 min
What's going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Bayesian statistics broad enough in scope to encompass deep neural networks that may help answer these questions. In this episode, I speak with Daniel Murfet about this research program and what it tells us. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:26 - What is singular learning theory? 0:16:00...
May 07, 2024•2 hr 32 min
Top labs use various forms of "safety training" on models before their release to make sure they don't do nasty stuff - but how robust is that? How can we ensure that the weights of powerful AIs don't get leaked or stolen? And what can AI even do these days? In this episode, I speak with Jeffrey Ladish about security and AI. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:38 - Fine-tuning away safety training 0:13:50 - Dangers of open LLMs vs i...
Apr 30, 2024•2 hr 16 min
In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that gen...
Apr 25, 2024•2 hr 14 min
How should the law govern AI? Those concerned about existential risks often push either for bans or for regulations meant to ensure that AI is developed safely - but another approach is possible. In this episode, Gabriel Weil talks about his proposal to modify tort law to enable people to sue AI companies for disasters that are "nearly catastrophic". Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:35 - The basic idea 0:20:36 - Tort law vs regul...
Apr 17, 2024•1 hr 58 min
A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:31 - What is AI contro...
Apr 11, 2024•2 hr 56 min
The events of this year have highlighted important questions about the governance of artificial intelligence. For instance, what does it mean to democratize AI? And how should we balance benefits and dangers of open-sourcing powerful AI systems such as large language models? In this episode, I speak with Elizabeth Seger about her research on these questions. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: - 0:00:40 - What kinds of AI? - 0:01:30 - De...
Nov 26, 2023•1 hr 57 min
Imagine a world where there are many powerful AI systems, working at cross purposes. You could suppose that different governments use AIs to manage their militaries, or simply that many powerful AIs have their own wills. At any rate, it seems valuable for them to be able to cooperatively work together and minimize pointless conflict. How do we ensure that AIs behave this way - and what do we need to learn about how rational agents interact to make that more clear? In this episode, I'll be speaki...
Oct 03, 2023•3 hr 2 min
Recently, OpenAI made a splash by announcing a new "Superalignment" team. Lead by Jan Leike and Ilya Sutskever, the team would consist of top researchers, attempting to solve alignment for superintelligent AIs in four years by figuring out how to build a trustworthy human-level AI alignment researcher, and then using it to solve the rest of the problem. But what does this plan actually involve? In this episode, I talk to Jan Leike about the plan and the challenges it faces. Patreon: patreon.com/...
Jul 27, 2023•2 hr 8 min
Is there some way we can detect bad behaviour in our AI system without having to know exactly what it looks like? In this episode, I speak with Mark Xu about mechanistic anomaly detection: a research direction based on the idea of detecting strange things happening in neural networks, in the hope that that will alert us of potential treacherous turns. We both talk about the core problems of relating these mechanistic anomalies to bad behaviour, as well as the paper "Formalizing the presumption o...
Jul 27, 2023•2 hr 6 min