AXRP - the AI X-risk Research Podcast

Daniel Filan•axrp.net

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

Last refreshed: June 27th, 2025 at 12:02 PM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Survey, store closing, Patreon

Very brief survey: bit.ly/axrpsurvey2023 Store is closing in a week! Link: store.axrp.net/ Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast

Jun 28, 2023•4 min

22 - Shard Theory with Quintin Pope

What can we learn about advanced deep learning systems by understanding how humans learn and form values over their lifetimes? Will superhuman AI look like ruthless coherent utility optimization, or more like a mishmash of contextually activated desires? This episode's guest, Quintin Pope, has been thinking about these questions as a leading researcher in the shard theory community. We talk about what shard theory is, what it says about humans and neural networks, and what the implications are f...

Jun 15, 2023•3 hr 28 min

21 - Interpretability for Engineers with Stephen Casper

Lots of people in the field of machine learning study 'interpretability', developing tools that they say give us useful information about neural networks. But how do we know if meaningful progress is actually being made? What should we want out of these tools? In this episode, I speak to Stephen Casper about these questions, as well as about a benchmark he's co-developed to evaluate whether interpretability tools can find 'Trojan horses' hidden inside neural nets. Patreon: patreon.com/axrpodcast...

May 02, 2023•1 hr 56 min

20 - 'Reform' AI Alignment with Scott Aaronson

How should we scientifically think about the impact of AI on human civilization, and whether or not it will doom us all? In this episode, I speak with Scott Aaronson about his views on how to make progress in AI alignment, as well as his work on watermarking the output of language models, and how he moved from a background in quantum complexity theory to working on AI. Note: this episode was recorded before this story ( vice.com/en/article/pkadgm/man-dies-by-suicide-after-talking-with-ai-chatbot...

Apr 12, 2023•2 hr 28 min

Store, Patreon, Video

Store: https://store.axrp.net/ Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Video: https://www.youtube.com/watch?v=kmPFjpEibu0

Feb 07, 2023•3 min

19 - Mechanistic Interpretability with Neel Nanda

How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking. Topics we discuss, and timestamps: - 00:01:05 - What is mechanistic interpretability? - 00:24:16 - Types of AI cognition - 00:54:27 - Automating mecha...

Feb 04, 2023•3 hr 53 min

New podcast - The Filan Cabinet

I have a new podcast, where I interview whoever I want about whatever I want. It's called "The Filan Cabinet", and you can find it wherever you listen to podcasts. The first three episodes are about pandemic preparedness, God, and cryptocurrency. For more details, check out the podcast website ( thefilancabinet.com ), or search "The Filan Cabinet" in your podcast app.

Oct 13, 2022•1 min

18 - Concept Extrapolation with Stuart Armstrong

Concept extrapolation is the idea of taking concepts an AI has about the world - say, "mass" or "does this picture contain a hot dog" - and extending them sensibly to situations where things are different - like learning that the world works via special relativity, or seeing a picture of a novel sausage-bread combination. For a while, Stuart Armstrong has been thinking about concept extrapolation and how it relates to AI alignment. In this episode, we discuss where his thoughts are at on this to...

Sep 03, 2022•1 hr 46 min

17 - Training for Very High Reliability with Daniel Ziegler

Sometimes, people talk about making AI systems safe by taking examples where they fail and training them to do well on those. But how can we actually do this well, especially when we can't use a computer program to say what a 'failure' is? In this episode, I speak with Daniel Ziegler about his research group's efforts to try doing this with present-day language models, and what they learned. Listeners beware: this episode contains a spoiler for the Animorphs franchise around minute 41 (in the 'F...

Aug 21, 2022•1 hr 1 min

16 - Preparing for Debate AI with Geoffrey Irving

Many people in the AI alignment space have heard of AI safety via debate - check out AXRP episode 6 ( axrp.net/episode/2021/04/08/episode-6-debate-beth-barnes.html ) if you need a primer. But how do we get language models to the stage where they can usefully implement debate? In this episode, I talk to Geoffrey Irving about the role of language models in AI safety, as well as three projects he's done that get us closer to making debate happen: using language models to find flaws in themselves, g...

Jul 01, 2022•1 hr 5 min

15 - Natural Abstractions with John Wentworth

Why does anybody care about natural abstractions? Do they somehow relate to math, or value learning? How do E. coli bacteria find sources of sugar? All these questions and more will be answered in this interview with John Wentworth, where we talk about his research plan of understanding agency via natural abstractions. Topics we discuss, and timestamps: - 00:00:31 - Agency in E. Coli - 00:04:59 - Agency in financial markets - 00:08:44 - Inferring agency in real-world systems - 00:16:11 - Selecti...

May 23, 2022•1 hr 37 min

14 - Infra-Bayesian Physicalism with Vanessa Kosoy

Late last year, Vanessa Kosoy and Alexander Appel published some research under the heading of "Infra-Bayesian physicalism". But wait - what was infra-Bayesianism again? Why should we care? And what does any of this have to do with physicalism? In this episode, I talk with Vanessa Kosoy about these questions, and get a technical overview of how infra-Bayesian physicalism works and what its implications are. Topics we discuss, and timestamps: - 00:00:48 - The basics of infra-Bayes - 00:08:32 - An...

Apr 05, 2022•1 hr 48 min

13 - First Principles of AGI Safety with Richard Ngo

How should we think about artificial general intelligence (AGI), and the risks it might pose? What constraints exist on technical solutions to the problem of aligning superhuman AI systems with human intentions? In this episode, I talk to Richard Ngo about his report analyzing AGI safety from first principles, and recent conversations he had with Eliezer Yudkowsky about the difficulty of AI alignment. Topics we discuss, and timestamps: - 00:00:40 - The nature of intelligence and AGI - 00:01:18 -...

Mar 31, 2022•1 hr 34 min

12 - AI Existential Risk with Paul Christiano

Why would advanced AI systems pose an existential risk, and what would it look like to develop safer systems? In this episode, I interview Paul Christiano about his views of how AI could be so dangerous, what bad AI scenarios could look like, and what he thinks about various techniques to reduce this risk. Topics we discuss, and timestamps: - 00:00:38 - How AI may pose an existential threat - 00:13:36 - AI timelines - 00:24:49 - Why we might build risky AI - 00:33:58 - Takeoff speeds - 00:51:33 ...

Dec 02, 2021•2 hr 50 min

11 - Attainable Utility and Power with Alex Turner

Many scary stories about AI involve an AI system deceiving and subjugating humans in order to gain the ability to achieve its goals without us stopping it. This episode's guest, Alex Turner, will tell us about his research analyzing the notions of "attainable utility" and "power" that underlie these stories, so that we can better evaluate how likely they are and how to prevent them. Topics we discuss: - Side effects minimization - Attainable Utility Preservation (AUP) - AUP and alignment - Power...

Sep 25, 2021•1 hr 28 min

10 - AI's Future and Impacts with Katja Grace

When going about trying to ensure that AI does not cause an existential catastrophe, it's likely important to understand how AI will develop in the future, and why exactly it might or might not cause such a catastrophe. In this episode, I interview Katja Grace, researcher at AI Impacts, who's done work surveying AI researchers about when they expect superhuman AI to be reached, collecting data about how rapidly AI tends to progress, and thinking about the weak points in arguments that AI could b...

Jul 23, 2021•2 hr 3 min

9 - Finite Factored Sets with Scott Garrabrant

Being an agent can get loopy quickly. For instance, imagine that we're playing chess and I'm trying to decide what move to make. Your next move influences the outcome of the game, and my guess of that influences my move, which influences your next move, which influences the outcome of the game. How can we model these dependencies in a general way, without baking in primitive notions of 'belief' or 'agency'? Today, I talk with Scott Garrabrant about his recent work on finite factored sets that ai...

Jun 24, 2021•1 hr 39 min

8 - Assistance Games with Dylan Hadfield-Menell

How should we think about the technical problem of building smarter-than-human AI that does what we want? When and how should AI systems defer to us? Should they have their own goals, and how should those goals be managed? In this episode, Dylan Hadfield-Menell talks about his work on assistance games that formalizes these questions. The first couple years of my PhD program included many long conversations with Dylan that helped shape how I view AI x-risk research, so it was great to have anothe...

Jun 08, 2021•2 hr 23 min

7.5 - Forecasting Transformative AI from Biological Anchors with Ajeya Cotra

If you want to shape the development and forecast the consequences of powerful AI technology, it's important to know when it might appear. In this episode, I talk to Ajeya Cotra about her draft report "Forecasting Transformative AI from Biological Anchors" which aims to build a probabilistic model to answer this question. We talk about a variety of topics, including the structure of the model, what the most important parts are to get right, how the estimates should shape our behaviour, and Ajeya...

May 28, 2021•1 min

7 - Side Effects with Victoria Krakovna

One way of thinking about how AI might pose an existential threat is by taking drastic actions to maximize its achievement of some objective function, such as taking control of the power supply or the world's computers. This might suggest a mitigation strategy of minimizing the degree to which AI systems have large effects on the world that are not absolutely necessary for achieving their objective. In this episode, Victoria Krakovna talks about her research on quantifying and minimizing side ef...

May 14, 2021•1 hr 19 min

6 - Debate and Imitative Generalization with Beth Barnes

One proposal to train AIs that can be useful is to have ML models debate each other about the answer to a human-provided question, where the human judges which side has won. In this episode, I talk with Beth Barnes about her thoughts on the pros and cons of this strategy, what she learned from seeing how humans behaved in debate protocols, and how a technique called imitative generalization can augment debate. Those who are already quite familiar with the basic proposal might want to skip past t...

Apr 08, 2021•1 hr 59 min

5 - Infra-Bayesianism with Vanessa Kosoy

The theory of sequential decision-making has a problem: how can we deal with situations where we have some hypotheses about the environment we're acting in, but its exact form might be outside the range of possibilities we can possibly consider? Relatedly, how do we deal with situations where the environment can simulate what we'll do in the future, and put us in better or worse situations now depending on what we'll do then? Today's episode features Vanessa Kosoy talking about infra-Bayesianism...

Mar 10, 2021•1 hr 24 min

4 - Risks from Learned Optimization with Evan Hubinger

In machine learning, typically optimization is done to produce a model that performs well according to some metric. Today's episode features Evan Hubinger talking about what happens when the learned model itself is doing optimization in order to perform well, how the goals of the learned model could differ from the goals we used to select the learned model, and what would happen if they did differ. Link to the paper - Risks from Learned Optimization in Advanced Machine Learning Systems: arxiv.or...

Feb 17, 2021•2 hr 14 min

3 - Negotiable Reinforcement Learning with Andrew Critch

In this episode, I talk with Andrew Critch about negotiable reinforcement learning: what happens when two people (or organizations, or what have you) who have different beliefs and preferences jointly build some agent that will take actions in the real world. In the paper we discuss, it's proven that the only way to make such an agent Pareto optimal - that is, have it not be the case that there's a different agent that both people would prefer to use instead - is to have it preferentially optimi...

Dec 11, 2020•58 min

2 - Learning Human Biases with Rohin Shah

One approach to creating useful AI systems is to watch humans doing a task, infer what they're trying to do, and then try to do that well. The simplest way to infer what the humans are trying to do is to assume there's one goal that they share, and that they're optimally achieving the goal. This has the problem that humans aren't actually optimal at achieving the goals they pursue. We could instead code in the exact way in which humans behave suboptimally, except that we don't know that either. ...

Dec 11, 2020•1 hr 9 min

1 - Adversarial Policies with Adam Gleave

In this episode, Adam Gleave and I talk about adversarial policies. Basically, in current reinforcement learning, people train agents that act in some kind of environment, sometimes an environment that contains other agents. For instance, you might train agents that play sumo with each other, with the objective of making them generally good at sumo. Adam's research looks at the case where all you're trying to do is make an agent that defeats one specific other agents: how easy is it, and what ha...

Dec 11, 2020•59 min

← Prev

Hosted on Libsyn

For the best experience, listen in Metacast app for iOS or Android