Skip to content
AXRP - the AI X-risk Research Podcast artwork

AXRP - the AI X-risk Research Podcast

Daniel Filan·62 episodes

ScienceTechnology

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

Episodes

2 hr 32 min
Feb 18, 2026Episode 49
Caspar Oesterheld on Program Equilibrium

How does game theory work when everyone is a computer program who can read everyone else's source code? This is the problem of 'program equilibria'. In this episode, I talk with Caspar Oesterheld on work he's done on equilibria of programs that simulate each other, and how robust these equilibria are. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2026/02/18/episode-49-caspar-oesterheld-program-equilibrium.html Note from Caspar on 2:00:06: At least given my current interpretation of what you say here, my answer is wrong. What actually happens is that we're just back in the uncorrelated case. Basically my simulations will be a simulated repeated game in which everything is correlated _because I feed you my random sequence_ and your simulations will be a repeated game where everything is correlated. Halting works the same as usual. But of course what we end up actually playing will be uncorrelated. We discuss something like this later in the episode. Topics we discuss, and timestamps: 0:00:44 Program equilibrium basics 0:14:20 Desiderata for program equilibria 0:24:35 Why program equilibrium matters 0:33:35 Prior work: reachable equilibria and proof-based approaches 0:53:26 The basic idea of Robust Program Equilibrium 1:07:47 Are ϵGroundedπBots inefficient? 1:15:06 Compatibility of proof-based and simulation-based program equilibria 1:18:32 Cooperating against CooperateBot, and how to avoid it 1:44:43 Making better simulation-based bots 2:01:22 Characterizing simulation-based program equilibria 2:21:24 Follow-up work 2:29:49 Following Caspar's research Links for Caspar: Academic website: https://www.andrew.cmu.edu/user/coesterh/ Google Scholar: https://scholar.google.com/citations?user=xeEcRjkAAAAJ&hl=en Blog: https://casparoesterheld.com/ X / Twitter: https://x.com/c_oesterheld Research we discuss: Robust program equilibrium: https://link.springer.com/article/10.1007/s11238-018-9679-3 Characterising Simulation-Based Program Equilibria: https://arxiv.org/abs/2412.14570 Manifold open-source prisoner's dilemma tournament: https://manifold.markets/IsaacKing/whi

2 hr 5 min
Feb 15, 2026Episode 48
Guive Assadi on AI Property Rights

In this episode, Guive Assadi argues that we should give AIs property rights, so that they are integrated in our system of property and come to rely on it. The claim is that this means that AIs would not kill or steal from humans, because that would undermine the whole property system, which would be extremely valuable to them. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2026/02/15/episode-48-guive-assadi-ai-property-rights.html   Topics we discuss, and timestamps: 0:00:28 AI property rights 0:08:01 Why not steal from and kill humans 0:15:25 Why AIs may fear it could be them next 0:20:56 AI retirement 0:23:28 Could humans be upgraded to stay useful? 0:26:41 Will AI progress continue? 0:30:00 Why non-obsoletable AIs may still not end human property rights 0:38:35 Why make AIs with property rights? 0:48:01 Do property rights incentivize alignment? 0:50:09 Humans and non-human property rights 1:02:18 Humans and non-human bodily autonomy 1:16:59 Step changes in coordination ability 1:24:39 Acausal coordination 1:32:37 AI, humans, and civilizations with different technology levels 1:41:39 The case of British settlers and Tasmanians 1:47:22 Non-total expropriation 1:53:47 How Guive thinks x-risk could happen, and other loose ends 2:03:46 Following Guive's work   Guive on Substack: https://guive.substack.com/ Guive on X/Twitter: https://x.com/GuiveAssadi   Research we discuss: The Case for AI Property Rights: https://guive.substack.com/p/the-case-for-ai-property-rights AXRP Episode 44 - Peter Salib on AI Rights for Human Safety: https://axrp.net/episode/2025/06/28/episode-44-peter-salib-ai-rights-human-safety.html AI Rights for Human Safety (by Salib and Goldstein): https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4913167 We don't trade with ants: https://worldspiritsockpuppet.substack.com/p/we-dont-trade-with-ants Alignment Fine-tuning is Character Writing (on Claude as a techy philosophy SF-dwelling type): https://guive.substack.com/p/alignment-fine-tuning-is-character

1 hr 47 min
Jan 2, 2026Episode 47
David Rein on METR Time Horizons

When METR says something like "Claude Opus 4.5 has a 50% time horizon of 4 hours and 50 minutes", what does that mean? In this episode David Rein, METR researcher and co-author of the paper "Measuring AI ability to complete long tasks", talks about METR's work on measuring time horizons, the methodology behind those numbers, and what work remains to be done in this domain. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2026/01/03/episode-47-david-rein-metr-time-horizons.html Topics we discuss, and timestamps: 0:00:32 Measuring AI Ability to Complete Long Tasks 0:10:54 The meaning of "task length" 0:19:27 Examples of intermediate and hard tasks 0:25:12 Why the software engineering focus 0:32:17 Why task length as difficulty measure 0:46:32 Is AI progress going superexponential? 0:50:58 Is AI progress due to increased cost to run models? 0:54:45 Why METR measures model capabilities 1:04:10 How time horizons relate to recursive self-improvement 1:12:58 Cost of estimating time horizons 1:16:23 Task realism vs mimicking important task features 1:19:50 Excursus on "Inventing Temperature" 1:25:46 Return to task realism discussion 1:33:53 Open questions on time horizons Links for METR: Main website: https://metr.org/ X/Twitter account: https://x.com/METR_Evals/ Research we discuss: Measuring AI Ability to Complete Long Tasks: https://arxiv.org/abs/2503.14499 RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts: https://arxiv.org/abs/2411.15114 HCAST: Human-Calibrated Autonomy Software Tasks: https://arxiv.org/abs/2503.17354 Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity: https://arxiv.org/abs/2507.09089 Anthropic Economic Index: Tracking AI's role in the US and global economy: https://www.anthropic.com/research/anthropic-economic-index-september-2025-report Bridging RL Theory and Practice with the Effective Horizon (i.e. the Cassidy Laidlaw paper): https://arxiv.org/abs/2304.09853 How Does Time Horizon Vary Across Domains?: <a href= "https://metr.org/blog/2025-07-14-how-does-time-h

2 hr 5 min
Aug 7, 2025Episode 46
Tom Davidson on AI-enabled Coups

Could AI enable a small group to gain power over a large country, and lock in their power permanently? Often, people worried about catastrophic risks from AI have been concerned with misalignment risks. In this episode, Tom Davidson talks about a risk that could be comparably important: that of AI-enabled coups. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/08/07/episode-46-tom-davidson-ai-enabled-coups.html   Topics we discuss, and timestamps: 0:00:35 How to stage a coup without AI 0:16:17 Why AI might enable coups 0:33:29 How bad AI-enabled coups are 0:37:28 Executive coups with singularly loyal AIs 0:48:35 Executive coups with exclusive access to AI 0:54:41 Corporate AI-enabled coups 0:57:56 Secret loyalty and misalignment in corporate coups 1:11:39 Likelihood of different types of AI-enabled coups 1:25:52 How to prevent AI-enabled coups 1:33:43 Downsides of AIs loyal to the law 1:41:06 Cultural shifts vs individual action 1:45:53 Technical research to prevent AI-enabled coups 1:51:40 Non-technical research to prevent AI-enabled coups 1:58:17 Forethought 2:03:03 Following Tom's and Forethought's research   Links for Tom and Forethought: Tom on X / Twitter: https://x.com/tomdavidsonx Tom on LessWrong: https://www.lesswrong.com/users/tom-davidson-1 Forethought Substack: https://newsletter.forethought.org/ Will MacAskill on X / Twitter: https://x.com/willmacaskill Will MacAskill on LessWrong: https://www.lesswrong.com/users/wdmacaskill   Research we discuss: AI-Enabled Coups: How a Small Group Could Use AI to Seize Power: https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power Seizing Power: The Strategic Logic of Military Coups, by Naunihal Singh: https://muse.jhu.edu/book/31450 Experiment using AI-generated posts on Reddit draws fire for ethics concerns: https://retractionwatch.com/2025/04/28/experiment-using-ai-generated-posts-on-reddit-draws-fire-for-ethics-concerns/

1 hr 15 min
Jul 6, 2025Episode 45
Samuel Albanie on DeepMind's AGI Safety Approach

In this episode, I chat with Samuel Albanie about the Google DeepMind paper he co-authored called "An Approach to Technical AGI Safety and Security". It covers the assumptions made by the approach, as well as the types of mitigations it outlines. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/07/06/episode-45-samuel-albanie-deepminds-agi-safety-approach.html   Topics we discuss, and timestamps: 0:00:37 DeepMind's Approach to Technical AGI Safety and Security 0:04:29 Current paradigm continuation 0:19:13 No human ceiling 0:21:22 Uncertain timelines 0:23:36 Approximate continuity and the potential for accelerating capability improvement 0:34:29 Misuse and misalignment 0:39:34 Societal readiness 0:43:58 Misuse mitigations 0:52:57 Misalignment mitigations 1:05:20 Samuel's thinking about technical AGI safety 1:14:02 Following Samuel's work   Samuel on Twitter/X: x.com/samuelalbanie   Research we discuss: An Approach to Technical AGI Safety and Security: https://arxiv.org/abs/2504.01849 Levels of AGI for Operationalizing Progress on the Path to AGI: https://arxiv.org/abs/2311.02462 The Checklist: What Succeeding at AI Safety Will Involve: https://sleepinyourhat.github.io/checklist/ Measuring AI Ability to Complete Long Tasks: https://arxiv.org/abs/2503.14499   Episode art by Hamish Doodles: hamishdoodles.com

3 hr 21 min
Jun 28, 2025Episode 44
Peter Salib on AI Rights for Human Safety

In this episode, I talk with Peter Salib about his paper "AI Rights for Human Safety", arguing that giving AIs the right to contract, hold property, and sue people will reduce the risk of their trying to attack humanity and take over. He also tells me how law reviews work, in the face of my incredulity. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/28/episode-44-peter-salib-ai-rights-human-safety.html   Topics we discuss, and timestamps: 0:00:40 Why AI rights 0:18:34 Why not reputation 0:27:10 Do AI rights lead to AI war? 0:36:42 Scope for human-AI trade 0:44:25 Concerns with comparative advantage 0:53:42 Proxy AI wars 0:57:56 Can companies profitably make AIs with rights? 1:09:43 Can we have AI rights and AI safety measures? 1:24:31 Liability for AIs with rights 1:38:29 Which AIs get rights? 1:43:36 AI rights and stochastic gradient descent 1:54:54 Individuating "AIs" 2:03:28 Social institutions for AI safety 2:08:20 Outer misalignment and trading with AIs 2:15:27 Why statutes of limitations should exist 2:18:39 Starting AI x-risk research in legal academia 2:24:18 How law reviews and AI conferences work 2:41:49 More on Peter moving to AI x-risk research 2:45:37 Reception of the paper 2:53:24 What publishing in law reviews does 3:04:48 Which parts of legal academia focus on AI 3:18:03 Following Peter's research   Links for Peter: Personal website: https://www.peternsalib.com/ Writings at Lawfare: https://www.lawfaremedia.org/contributors/psalib CLAIR: https://clair-ai.org/   Research we discuss: AI Rights for Human Safety: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4913167 Will humans and AIs go to war? https://philpapers.org/rec/GOLWAA Infrastructure for AI agents: https://arxiv.org/abs/2501.10114 Governing AI Agents: https://arxiv.org/abs/2501.07913   Episode art by Hamish Doodles: hamishdoodles.com

1 hr 40 min
Jun 15, 2025Episode 43
David Lindner on Myopic Optimization with Non-myopic Approval

In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/15/episode-43-david-lindner-mona.html   Topics we discuss, and timestamps: 0:00:29 What MONA is 0:06:33 How MONA deals with reward hacking 0:23:15 Failure cases for MONA 0:36:25 MONA's capability 0:55:40 MONA vs other approaches 1:05:03 Follow-up work 1:10:17 Other MONA test cases 1:33:47 When increasing time horizon doesn't increase capability 1:39:04 Following David's research   Links for David: Website: https://www.davidlindner.me Twitter / X: https://x.com/davlindner DeepMind Medium: https://deepmindsafetyresearch.medium.com David on the Alignment Forum: https://www.alignmentforum.org/users/david-lindner   Research we discuss: MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking: https://arxiv.org/abs/2501.13011 Arguments Against Myopic Training: https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training   Episode art by Hamish Doodles: hamishdoodles.com

2 hr 14 min
Jun 6, 2025Episode 42
Owain Evans on LLM Psychology

Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/06/episode-42-owain-evans-llm-psychology.html   Topics we discuss, and timestamps: 0:00:37 Why introspection? 0:06:24 Experiments in "Looking Inward" 0:15:11 Why fine-tune for introspection? 0:22:32 Does "Looking Inward" test introspection, or something else? 0:34:14 Interpreting the results of "Looking Inward" 0:44:56 Limitations to introspection? 0:49:54 "Tell me about yourself", and its relation to other papers 1:05:45 Backdoor results 1:12:01 Emergent Misalignment 1:22:13 Why so hammy, and so infrequently evil? 1:36:31 Why emergent misalignment? 1:46:45 Emergent misalignment and other types of misalignment 1:53:57 Is emergent misalignment good news? 2:00:01 Follow-up work to "Emergent Misalignment" 2:03:10 Reception of "Emergent Misalignment" vs other papers 2:07:43 Evil numbers 2:12:20 Following Owain's research   Links for Owain: Truthful AI: https://www.truthfulai.org Owain's website: https://owainevans.github.io/ Owain's twitter/X account: https://twitter.com/OwainEvans_UK   Research we discuss: Looking Inward: Language Models Can Learn About Themselves by Introspection: https://arxiv.org/abs/2410.13787 Tell me about yourself: LLMs are aware of their learned behaviors: https://arxiv.org/abs/2501.11120 Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https://arxiv.org/abs/2406.14546 Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https://arxiv.org/abs/2502.17424 X/Twitter thread of GPT-4.1 emergent misalignment results: https://x.com/OwainEvans_UK/status/1912701650051190852 Taken out of context: On me

2 hr 16 min
Jun 3, 2025Episode 41
Lee Sharkey on Attribution-based Parameter Decomposition

What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html   Topics we discuss, and timestamps: 0:00:41 APD basics 0:07:57 Faithfulness 0:11:10 Minimality 0:28:44 Simplicity 0:34:50 Concrete-ish examples of APD 0:52:00 Which parts of APD are canonical 0:58:10 Hyperparameter selection 1:06:40 APD in toy models of superposition 1:14:40 APD and compressed computation 1:25:43 Mechanisms vs representations 1:34:41 Future applications of APD? 1:44:19 How costly is APD? 1:49:14 More on minimality training 1:51:49 Follow-up work 2:05:24 APD on giant chain-of-thought models? 2:11:27 APD and "features" 2:14:11 Following Lee's work   Lee links (Leenks): X/Twitter: https://twitter.com/leedsharkey Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey   Research we discuss: Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926 Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476 Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis   Episode art by Hamish Doodles: hamishdoodles.com

2 hr 36 min
Mar 28, 2025Episode 40
Jason Gross on Compact Proofs and Interpretability

How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing. In this episode, I speak with Jason Gross about his agenda to benchmark interpretability in this way, and his exploration of the intersection of proofs and modern machine learning. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/03/28/episode-40-jason-gross-compact-proofs-interpretability.html   Topics we discuss, and timestamps: 0:00:40 - Why compact proofs 0:07:25 - Compact Proofs of Model Performance via Mechanistic Interpretability 0:14:19 - What compact proofs look like 0:32:43 - Structureless noise, and why proofs 0:48:23 - What we've learned about compact proofs in general 0:59:02 - Generalizing 'symmetry' 1:11:24 - Grading mechanistic interpretability 1:43:34 - What helps compact proofs 1:51:08 - The limits of compact proofs 2:07:33 - Guaranteed safe AI, and AI for guaranteed safety 2:27:44 - Jason and Rajashree's start-up 2:34:19 - Following Jason's work   Links to Jason: Github: https://github.com/jasongross Website: https://jasongross.github.io Alignment Forum: https://www.alignmentforum.org/users/jason-gross   Links to work we discuss: Compact Proofs of Model Performance via Mechanistic Interpretability: https://arxiv.org/abs/2406.11779 Unifying and Verifying Mechanistic Interpretability: A Case Study with Group Operations: https://arxiv.org/abs/2410.07476 Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration: https://arxiv.org/abs/2412.03773 Stage-Wise Model Diffing: https://transformer-circuits.pub/2024/model-diffing/index.html Causal Scrubbing: a method for rigorously testing interpretability hypotheses: https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition (aka the Apollo paper on APD): <a href= "htt

20 min
Mar 1, 2025
38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier models can sabotage human decision-making or monitoring of the same models; and secondly, the difficult situation humans find themselves in in a post-AGI future, even if AI is aligned with human intentions.   Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/03/01/episode-38_8-david-duvenaud-sabotage-evaluations-post-agi-future.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: @FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps: 01:42 - The difficulty of sabotage evaluations 05:23 - Types of sabotage evaluation 08:45 - The state of sabotage evaluations 12:26 - What happens after AGI?   Links: Sabotage Evaluations for Frontier Models: https://arxiv.org/abs/2410.21514 Gradual Disempowerment: https://gradual-disempowerment.ai/   Episode art by Hamish Doodles: hamishdoodles.com

22 min
Feb 9, 2025
38.7 - Anthony Aguirre on the Future of Life Institute

The Future of Life Institute is one of the oldest and most prominant organizations in the AI existential safety space, working on such topics as the AI pause open letter and how the EU AI Act can be improved. Metaculus is one of the premier forecasting sites on the internet. Behind both of them lie one man: Anthony Aguirre, who I talk with in this episode. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/02/09/episode-38_7-anthony-aguirre-future-of-life-institute.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps: 00:33 - Anthony, FLI, and Metaculus 06:46 - The Alignment Workshop 07:15 - FLI's current activity 11:04 - AI policy 17:09 - Work FLI funds   Links: Future of Life Institute: https://futureoflife.org/ Metaculus: https://www.metaculus.com/ Future of Life Foundation: https://www.flf.org/   Episode art by Hamish Doodles: hamishdoodles.com

15 min
Jan 24, 2025
38.6 - Joel Lehman on Positive Visions of AI

Typically this podcast talks about how to avert destruction from AI. But what would it take to ensure AI promotes human flourishing as well as it can? Is alignment to individuals enough, and if not, where do we go form here? In this episode, I talk with Joel Lehman about these questions. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/01/24/episode-38_6-joel-lehman-positive-visions-of-ai.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps:  01:12 - Why aligned AI might not be enough 04:05 - Positive visions of AI 08:27 - Improving recommendation systems   Links: Why Greatness Cannot Be Planned: https://www.amazon.com/Why-Greatness-Cannot-Planned-Objective/dp/3319155237 We Need Positive Visions of AI Grounded in Wellbeing: https://thegradientpub.substack.com/p/beneficial-ai-wellbeing-lehman-ngo Machine Love: https://arxiv.org/abs/2302.09248 AI Alignment with Changing and Influenceable Reward Functions: https://arxiv.org/abs/2405.17713   Episode art by Hamish Doodles: hamishdoodles.com

27 min
Jan 20, 2025
38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

Suppose we're worried about AIs engaging in long-term plans that they don't tell us about. If we were to peek inside their brains, what should we look for to check whether this was happening? In this episode Adrià Garriga-Alonso talks about his work trying to answer this question. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/01/20/episode-38_5-adria-garriga-alonso-detecting-ai-scheming.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps: 01:04 - The Alignment Workshop 02:49 - How to detect scheming AIs 05:29 - Sokoban-solving networks taking time to think 12:18 - Model organisms of long-term planning 19:44 - How and why to study planning in networks   Links: Adrià's website: https://agarri.ga/ An investigation of model-free planning: https://arxiv.org/abs/1901.03559 Model-Free Planning: https://tuphs28.github.io/projects/interpplanning/ Planning in a recurrent neural network that plays Sokoban: https://arxiv.org/abs/2407.15421   Episode art by Hamish Doodles: hamishdoodles.com

24 min
Jan 5, 2025
38.4 - Shakeel Hashim on AI Journalism

AI researchers often complain about the poor coverage of their work in the news media. But why is this happening, and how can it be fixed? In this episode, I speak with Shakeel Hashim about the resource constraints facing AI journalism, the disconnect between journalists' and AI researchers' views on transformative AI, and efforts to improve the state of AI journalism, such as Tarbell and Shakeel's newsletter, Transformer. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2025/01/05/episode-38_4-shakeel-hashim-ai-journalism.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps: 01:31 - The AI media ecosystem 02:34 - Why not more AI news? 07:18 - Disconnects between journalists and the AI field 12:42 - Tarbell 18:44 - The Transformer newsletter   Links: Transformer (Shakeel's substack): https://www.transformernews.ai/ Tarbell: https://www.tarbellfellowship.org/   Episode art by Hamish Doodles: hamishdoodles.com

20 min
Dec 14, 2024
38.4 - Peter Barnett on Technical Governance at MIRI

The Machine Intelligence Research Institute has recently shifted its focus to "technical governance". But what is that actually, and what are they doing? In this episode, I chat with Peter Barnett about his team's work on studying what evaluations can and cannot do, as well as verifying international agreements on AI development. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/12/14/episode-38_4-peter-barnett-technical-governance-at-miri.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch  FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps: 01:25 - MIRI's technical governance team 03:43 - Misuse evaluations 06:37 - Misalignment evaluations 11:29 - Verifying international agreements 13:30 - Difficulties in compute monitoring 16:44 - More on MIRI's technical governance team Links: MIRI Technical Governance Team: https://techgov.intelligence.org/ What AI evaluations for preventing catastrophic risks can and cannot do: https://techgov.intelligence.org/research/what-ai-evaluations-for-preventing-catastrophic-risks-can-and-cannot-do Declare and Justify: Explicit assumptions in AI evaluations are necessary for effective regulation: https://arxiv.org/abs/2411.12820 Mechanisms to Verify International Agreements About AI Development: https://techgov.intelligence.org/research/mechanisms-to-verify-international-agreements-about-ai-development Revisiting algorithmic progress: https://epoch.ai/blog/revisiting-algorithmic-progress   Episode art by Hamish Doodles: hamishdoodles.com

23 min
Dec 12, 2024
38.3 - Erik Jenner on Learned Look-Ahead

Lots of people in the AI safety space worry about models being able to make deliberate, multi-step plans. But can we already see this in existing neural nets? In this episode, I talk with Erik Jenner about his work looking at internal look-ahead within chess-playing neural networks. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/12/12/episode-38_3-erik-jenner-learned-look-ahead.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps: 00:57 - How chess neural nets look into the future 04:29 - The dataset and basic methodology 05:23 - Testing for branching futures? 07:57 - Which experiments demonstrate what 10:43 - How the ablation experiments work 12:38 - Effect sizes 15:23 - X-risk relevance 18:08 - Follow-up work 21:29 - How much planning does the network do?   Research we mention: Evidence of Learned Look-Ahead in a Chess-Playing Neural Network: https://arxiv.org/abs/2406.00877 Understanding the learned look-ahead behavior of chess neural networks (a development of the follow-up research Erik mentioned): https://openreview.net/forum?id=Tl8EzmgsEp Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT: https://arxiv.org/abs/2310.07582   Episode art by Hamish Doodles: hamishdoodles.com

1 hr 45 min
Dec 1, 2024Episode 39
Evan Hubinger on Model Organisms of Misalignment

The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he's worked on at Anthropic under this agenda: "Sleeper Agents" and "Sycophancy to Subterfuge". Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/12/01/episode-39-evan-hubinger-model-organisms-misalignment.html   Topics we discuss, and timestamps: 0:00:36 - Model organisms and stress-testing 0:07:38 - Sleeper Agents 0:22:32 - Do 'sleeper agents' properly model deceptive alignment? 0:38:32 - Surprising results in "Sleeper Agents" 0:57:25 - Sycophancy to Subterfuge 1:09:21 - How models generalize from sycophancy to subterfuge 1:16:37 - Is the reward editing task valid? 1:21:46 - Training away sycophancy and subterfuge 1:29:22 - Model organisms, AI control, and evaluations 1:33:45 - Other model organisms research 1:35:27 - Alignment stress-testing at Anthropic 1:43:32 - Following Evan's work   Main papers: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566 Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models: https://arxiv.org/abs/2406.10162   Anthropic links: Anthropic's newsroom: https://www.anthropic.com/news Careers at Anthropic: https://www.anthropic.com/careers   Other links: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research: https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1 Simple probes can catch sleeper agents: https://www.anthropic.com/research/probes-catch-sleeper-agents Studying Large Language Model Generalization with Influence Functions: https://arxiv.org/abs/2308.03296 Stress-Testing Capability Elicitation With Password-Locked Models [aka model organisms of sandbagging]: https://arxiv.org/abs/2405.19550

18 min
Nov 27, 2024
38.2 - Jesse Hoogland on Singular Learning Theory

You may have heard of singular learning theory, and its "local learning coefficient", or LLC - but have you heard of the refined LLC? In this episode, I chat with Jesse Hoogland about his work on SLT, and using the refined LLC to find a new circuit in language models.   Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/11/27/38_2-jesse-hoogland-singular-learning-theory.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch  FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps: 00:34 - About Jesse 01:49 - The Alignment Workshop 02:31 - About Timaeus 05:25 - SLT that isn't developmental interpretability 10:41 - The refined local learning coefficient 14:06 - Finding the multigram circuit   Links: Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient: https://arxiv.org/abs/2410.02984 Investigating the learning coefficient of modular addition: hackathon project: https://www.lesswrong.com/posts/4v3hMuKfsGatLXPgt/investigating-the-learning-coefficient-of-modular-addition   Episode art by Hamish Doodles: hamishdoodles.com

24 min
Nov 16, 2024
38.1 - Alan Chan on Agent Infrastructure

Road lines, street lights, and licence plates are examples of infrastructure used to ensure that roads operate smoothly. In this episode, Alan Chan talks about using similar interventions to help avoid bad outcomes from the deployment of AI agents. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/11/16/episode-38_1-alan-chan-agent-infrastructure.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch  FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps: 01:02 - How the Alignment Workshop is 01:32 - Agent infrastructure 04:57 - Why agent infrastructure 07:54 - A trichotomy of agent infrastructure 13:59 - Agent IDs 18:17 - Agent channels 20:29 - Relation to AI control   Links: Alan on Google Scholar: https://scholar.google.com/citations?user=lmQmYPgAAAAJ&hl=en&oi=ao IDs for AI Systems: https://arxiv.org/abs/2406.12137 Visibility into AI Agents: https://arxiv.org/abs/2401.13138   Episode art by Hamish Doodles: hamishdoodles.com

22 min
Nov 14, 2024
38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems

Do language models understand the causal structure of the world, or do they merely note correlations? And what happens when you build a big AI society out of them? In this brief episode, recorded at the Bay Area Alignment Workshop, I chat with Zhijing Jin about her research on these questions. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/11/14/episode-38_0-zhijing-jin-llms-causality-multi-agent-systems.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps: 00:35 - How the Alignment Workshop is 00:47 - How Zhijing got interested in causality and natural language processing 03:14 - Causality and alignment 06:21 - Causality without randomness 10:07 - Causal abstraction 11:42 - Why LLM causal reasoning? 13:20 - Understanding LLM causal reasoning 16:33 - Multi-agent systems   Links: Zhijing's website: https://zhijing-jin.com/fantasy/ Zhijing on X (aka Twitter): https://x.com/zhijingjin Can Large Language Models Infer Causation from Correlation?: https://arxiv.org/abs/2306.05836 Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents: https://arxiv.org/abs/2404.16698   Episode art by Hamish Doodles: hamishdoodles.com

1 hr 44 min
Oct 4, 2024Episode 37
Jaime Sevilla on AI Forecasting

Epoch AI is the premier organization that tracks the trajectory of AI - how much compute is used, the role of algorithmic improvements, the growth in data used, and when the above trends might hit an end. In this episode, I speak with the director of Epoch AI, Jaime Sevilla, about how compute, data, and algorithmic improvements are impacting AI, and whether continuing to scale can get us AGI. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/10/04/episode-37-jaime-sevilla-forecasting-ai.html   Topics we discuss, and timestamps: 0:00:38 - The pace of AI progress 0:07:49 - How Epoch AI tracks AI compute 0:11:44 - Why does AI compute grow so smoothly? 0:21:46 - When will we run out of computers? 0:38:56 - Algorithmic improvement 0:44:21 - Algorithmic improvement and scaling laws 0:56:56 - Training data 1:04:56 - Can scaling produce AGI? 1:16:55 - When will AGI arrive? 1:21:20 - Epoch AI 1:27:06 - Open questions in AI forecasting 1:35:21 - Epoch AI and x-risk 1:41:34 - Following Epoch AI's research   Links for Jaime and Epoch AI: Epoch AI: https://epochai.org/ Machine Learning Trends dashboard: https://epochai.org/trends Epoch AI on X / Twitter: https://x.com/EpochAIResearch Jaime on X / Twitter: https://x.com/Jsevillamol   Research we discuss: Training Compute of Frontier AI Models Grows by 4-5x per Year: https://epochai.org/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year Optimally Allocating Compute Between Inference and Training: https://epochai.org/blog/optimally-allocating-compute-between-inference-and-training Algorithmic Progress in Language Models [blog post]: https://epochai.org/blog/algorithmic-progress-in-language-models Algorithmic progress in language models [paper]: https://arxiv.org/abs/2403.05812 Training Compute-Optimal Large Language Models [aka the Chinchilla scaling law paper]: https://arxiv.org/abs/2203.15556 Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data [blog po

1 hr 48 min
Sep 29, 2024Episode 36
Adam Shai and Paul Riechers on Computational Mechanics

Sometimes, people talk about transformers as having "world models" as a result of being trained to predict text data on the internet. But what does this even mean? In this episode, I talk with Adam Shai and Paul Riechers about their work applying computational mechanics, a sub-field of physics studying how to predict random processes, to neural networks. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/09/29/episode-36-adam-shai-paul-riechers-computational-mechanics.html   Topics we discuss, and timestamps: 0:00:42 - What computational mechanics is 0:29:49 - Computational mechanics vs other approaches 0:36:16 - What world models are 0:48:41 - Fractals 0:57:43 - How the fractals are formed 1:09:55 - Scaling computational mechanics for transformers 1:21:52 - How Adam and Paul found computational mechanics 1:36:16 - Computational mechanics for AI safety 1:46:05 - Following Adam and Paul's research   Simplex AI Safety: https://www.simplexaisafety.com/   Research we discuss: Transformers represent belief state geometry in their residual stream: https://arxiv.org/abs/2405.15943 Transformers represent belief state geometry in their residual stream [LessWrong post]: https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer: https://www.lesswrong.com/posts/mBw7nc4ipdyeeEpWs/why-would-belief-states-have-a-fractal-structure-and-why   Episode art by Hamish Doodles: hamishdoodles.com

5 min
Sep 28, 2024
New Patreon tiers + MATS applications

Patreon: https://www.patreon.com/axrpodcast MATS: https://www.matsprogram.org Note: I'm employed by MATS, but they're not paying me to make this video.

2 hr 17 min
Aug 24, 2024Episode 35
Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit those locations to change the beliefs? Also, how are we going to get AI to perform tasks so hard that we can't figure out if they succeeded at them? In this episode, I chat with Peter Hase about his research into these questions. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/08/24/episode-35-peter-hase-llm-beliefs-easy-to-hard-generalization.html Topics we discuss, and timestamps: 0:00:36 - NLP and interpretability 0:10:20 - Interpretability lessons 0:32:22 - Belief interpretability 1:00:12 - Localizing and editing models' beliefs 1:19:18 - Beliefs beyond language models 1:27:21 - Easy-to-hard generalization 1:47:16 - What do easy-to-hard results tell us? 1:57:33 - Easy-to-hard vs weak-to-strong 2:03:50 - Different notions of hardness 2:13:01 - Easy-to-hard vs weak-to-strong, round 2 2:15:39 - Following Peter's work Peter on Twitter: https://x.com/peterbhase Peter's papers: Foundational Challenges in Assuring Alignment and Safety of Large Language Models: https://arxiv.org/abs/2404.09932 Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs: https://arxiv.org/abs/2111.13654 Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models: https://arxiv.org/abs/2301.04213 Are Language Models Rational? The Case of Coherence Norms and Belief Revision: https://arxiv.org/abs/2406.03442 The Unreasonable Effectiveness of Easy Training Data for Hard Tasks: https://arxiv.org/abs/2401.06751 Other links: Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV): https://arxiv.org/abs/1711.11279 Locating and Editing Factual Associations in GPT (aka the ROME paper): https://arxiv.org/abs/2202.05262 Of nonlinearity and commu

2 hr 14 min
Jul 28, 2024Episode 34
AI Evaluations with Beth Barnes

How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html Topics we discuss, and timestamps: 0:00:37 - What is METR? 0:02:44 - What is an "eval"? 0:14:42 - How good are evals? 0:37:25 - Are models showing their full capabilities? 0:53:25 - Evaluating alignment 1:01:38 - Existential safety methodology 1:12:13 - Threat models and capability buffers 1:38:25 - METR's policy work 1:48:19 - METR's relationships with labs 2:04:12 - Related research 2:10:02 - Roles at METR, and following METR's work Links for METR: METR: https://metr.org METR Task Development Guide - Bounty: https://taskdev.metr.org/bounty/ METR - Hiring: https://metr.org/hiring Autonomy evaluation resources: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/ Other links: Update on ARC's recent eval efforts (contains GPT-4 taskrabbit captcha story) https://metr.org/blog/2023-03-18-update-on-recent-evals/ Password-locked models: a stress case for capabilities evaluation: https://www.alignmentforum.org/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566 Untrusted smart models and trusted dumb models: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models AI companies aren't really using external evaluators: https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators Nobody Knows How to Safety-Test AI (Time): <a href= "https

1 hr 41 min
Jun 12, 2024Episode 33
RLHF Problems with Scott Emmons

Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/06/12/episode-33-rlhf-problems-scott-emmons.html Topics we discuss, and timestamps: 0:00:33 - Deceptive inflation 0:17:56 - Overjustification 0:32:48 - Bounded human rationality 0:50:46 - Avoiding these problems 1:14:13 - Dimensional analysis 1:23:32 - RLHF problems, in theory and practice 1:31:29 - Scott's research program 1:39:42 - Following Scott's research   Scott's website: https://www.scottemmons.com Scott's X/twitter account: https://x.com/emmons_scott When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning: https://arxiv.org/abs/2402.17747   Other works we discuss: AI Deception: A Survey of Examples, Risks, and Potential Solutions: https://arxiv.org/abs/2308.14752 Uncertain decisions facilitate better preference learning: https://arxiv.org/abs/2106.10394 Invariance in Policy Optimisation and Partial Identifiability in Reward Learning: https://arxiv.org/abs/2203.07475 The Humble Gaussian Distribution (aka principal component analysis and dimensional analysis): http://www.inference.org.uk/mackay/humble.pdf Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693   Episode art by Hamish Doodles: hamishdoodles.com

2 hr 22 min
May 30, 2024Episode 32
Understanding Agency with Jan Kulveit

What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about these questions with Jan Kulveit, who leads the Alignment of Complex Systems research group. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: axrp.net/episode/2024/05/30/episode-32-understanding-agency-jan-kulveit.html Topics we discuss, and timestamps: 0:00:47 - What is active inference? 0:15:14 - Preferences in active inference 0:31:33 - Action vs perception in active inference 0:46:07 - Feedback loops 1:01:32 - Active inference vs LLMs 1:12:04 - Hierarchical agency 1:58:28 - The Alignment of Complex Systems group   Website of the Alignment of Complex Systems group (ACS): acsresearch.org ACS on X/Twitter: x.com/acsresearchorg Jan on LessWrong: lesswrong.com/users/jan-kulveit Predictive Minds: Large Language Models as Atypical Active Inference Agents: arxiv.org/abs/2311.10215   Other works we discuss: Active Inference: The Free Energy Principle in Mind, Brain, and Behavior: https://www.goodreads.com/en/book/show/58275959 Book Review: Surfing Uncertainty: https://slatestarcodex.com/2017/09/05/book-review-surfing-uncertainty/ The self-unalignment problem: https://www.lesswrong.com/posts/9GyniEBaN3YYTqZXn/the-self-unalignment-problem Mitigating generative agent social dilemmas (aka language models writing contracts for Minecraft): https://social-dilemmas.github.io/   Episode art by Hamish Doodles: hamishdoodles.com

2 hr 32 min
May 7, 2024Episode 31
Singular Learning Theory with Daniel Murfet

What's going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Bayesian statistics broad enough in scope to encompass deep neural networks that may help answer these questions. In this episode, I speak with Daniel Murfet about this research program and what it tells us. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:26 - What is singular learning theory? 0:16:00 - Phase transitions 0:35:12 - Estimating the local learning coefficient 0:44:37 - Singular learning theory and generalization 1:00:39 - Singular learning theory vs other deep learning theory 1:17:06 - How singular learning theory hit AI alignment 1:33:12 - Payoffs of singular learning theory for AI alignment 1:59:36 - Does singular learning theory advance AI capabilities? 2:13:02 - Open problems in singular learning theory for AI alignment 2:20:53 - What is the singular fluctuation? 2:25:33 - How geometry relates to information 2:30:13 - Following Daniel Murfet's work The transcript: https://axrp.net/episode/2024/05/07/episode-31-singular-learning-theory-dan-murfet.html Daniel Murfet's twitter/X account: https://twitter.com/danielmurfet Developmental interpretability website: https://devinterp.com Developmental interpretability YouTube channel: https://www.youtube.com/@Devinterp Main research discussed in this episode: - Developmental Landscape of In-Context Learning: https://arxiv.org/abs/2402.02364 - Estimating the Local Learning Coefficient at Scale: https://arxiv.org/abs/2402.03698 - Simple versus Short: Higher-order degeneracy and error-correction: https://www.lesswrong.com/posts/nWRj6Ey8e5siAEXbK/simple-versus-short-higher-order-degeneracy-and-error-1 Other links: - Algebraic Geometry and Statistical Learning Theory (the grey book): https://www.cambridge.org/core/books/algebraic-geometry-and-statistical-learning-theory/9C8FD1BDC817E2FC79117C7F41544A3A - Mathematical Theory of Bayesian Statistics (the green book): https://www.routledge.com/Mathematical-Theory-of-Bayesia

2 hr 15 min
Apr 30, 2024Episode 30
AI Security with Jeffrey Ladish

Top labs use various forms of "safety training" on models before their release to make sure they don't do nasty stuff - but how robust is that? How can we ensure that the weights of powerful AIs don't get leaked or stolen? And what can AI even do these days? In this episode, I speak with Jeffrey Ladish about security and AI. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:38 - Fine-tuning away safety training 0:13:50 - Dangers of open LLMs vs internet search 0:19:52 - What we learn by undoing safety filters 0:27:34 - What can you do with jailbroken AI? 0:35:28 - Security of AI model weights 0:49:21 - Securing against attackers vs AI exfiltration 1:08:43 - The state of computer security 1:23:08 - How AI labs could be more secure 1:33:13 - What does Palisade do? 1:44:40 - AI phishing 1:53:32 - More on Palisade's work 1:59:56 - Red lines in AI development 2:09:56 - Making AI legible 2:14:08 - Following Jeffrey's research The transcript: axrp.net/episode/2024/04/30/episode-30-ai-security-jeffrey-ladish.html Palisade Research: palisaderesearch.org Jeffrey's Twitter/X account: twitter.com/JeffLadish Main papers we discussed: - LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B: arxiv.org/abs/2310.20624 - BadLLaMa: Cheaply Removing Safety Fine-tuning From LLaMa 2-Chat 13B: arxiv.org/abs/2311.00117 - Securing Artificial Intelligence Model Weights: rand.org/pubs/working_papers/WRA2849-1.html Other links: - Llama 2: Open Foundation and Fine-Tuned Chat Models: https://arxiv.org/abs/2307.09288 - Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693 - Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models: https://arxiv.org/abs/2310.02949 - On the Societal Impact of Open Foundation Models (Stanford paper on marginal harms from open-weight models): https://crfm.stanford.edu/open-fms/ - The Operational Risks of AI in Large-Scale Biological Attacks (RAND): https://www.rand

2 hr 13 min
Apr 25, 2024Episode 29
Science of Deep Learning with Vikrant Varma

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:36 - Challenges with unsupervised LLM knowledge discovery, aka contra CCS 0:00:36 - What is CCS? 0:09:54 - Consistent and contrastive features other than model beliefs 0:20:34 - Understanding the banana/shed mystery 0:41:59 - Future CCS-like approaches 0:53:29 - CCS as principal component analysis 0:56:21 - Explaining grokking through circuit efficiency 0:57:44 - Why research science of deep learning? 1:12:07 - Summary of the paper's hypothesis 1:14:05 - What are 'circuits'? 1:20:48 - The role of complexity 1:24:07 - Many kinds of circuits 1:28:10 - How circuits are learned 1:38:24 - Semi-grokking and ungrokking 1:50:53 - Generalizing the results 1:58:51 - Vikrant's research approach 2:06:36 - The DeepMind alignment team 2:09:06 - Follow-up work The transcript: axrp.net/episode/2024/04/25/episode-29-science-of-deep-learning-vikrant-varma.html Vikrant's Twitter/X account: twitter.com/vikrantvarma_ Main papers: - Challenges with unsupervised LLM knowledge discovery: arxiv.org/abs/2312.10029 - Explaining grokking through circuit efficiency: arxiv.org/abs/2309.02390 Other works discussed: - Discovering latent knowledge in language models without supervision (CCS): arxiv.org/abs/2212.03827 - Eliciting Latent Knowledge: How to Tell if your Eyes Deceive You: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit - Discussion: Challenges with unsupervised LLM knowledge dis

1 hr 57 min
Apr 17, 2024Episode 28
Suing Labs for AI Risk with Gabriel Weil

How should the law govern AI? Those concerned about existential risks often push either for bans or for regulations meant to ensure that AI is developed safely - but another approach is possible. In this episode, Gabriel Weil talks about his proposal to modify tort law to enable people to sue AI companies for disasters that are "nearly catastrophic". Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast   Topics we discuss, and timestamps: 0:00:35 - The basic idea 0:20:36 - Tort law vs regulation 0:29:10 - Weil's proposal vs Hanson's proposal 0:37:00 - Tort law vs Pigouvian taxation 0:41:16 - Does disagreement on AI risk make this proposal less effective? 0:49:53 - Warning shots - their prevalence and character 0:59:17 - Feasibility of big changes to liability law 1:29:17 - Interactions with other areas of law 1:38:59 - How Gabriel encountered the AI x-risk field 1:42:41 - AI x-risk and the legal field 1:47:44 - Technical research to help with this proposal 1:50:47 - Decisions this proposal could influence 1:55:34 - Following Gabriel's research   The transcript: axrp.net/episode/2024/04/17/episode-28-tort-law-for-ai-risk-gabriel-weil.html   Links for Gabriel:  - SSRN page: papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=1648032  - Twitter/X account: twitter.com/gabriel_weil   Tort Law as a Tool for Mitigating Catastrophic Risk from Artificial Intelligence: papers.ssrn.com/sol3/papers.cfm?abstract_id=4694006   Other links:  - Foom liability: overcomingbias.com/p/foom-liability  - Punitive Damages: An Economic Analysis: law.harvard.edu/faculty/shavell/pdf/111_Harvard_Law_Rev_869.pdf  - Efficiency, Fairness, and the Externalization of Reasonable Risks: The Problem With the Learned Hand Formula: papers.ssrn.com/sol3/papers.cfm?abstract_id=4466197  - Tort Law Can Play an Important Role in Mitigating AI Risk: forum.effectivealtruism.org/posts/epKBmiyLpZWWFEYDb/tort-law-can-play-an-important-role-in-mitigating-ai-risk  - How Technical AI Safety Researchers Can Help Implement P

2 hr 56 min
Apr 11, 2024Episode 27
AI Control with Buck Shlegeris and Ryan Greenblatt

A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast   Topics we discuss, and timestamps: 0:00:31 - What is AI control? 0:16:16 - Protocols for AI control 0:22:43 - Which AIs are controllable? 0:29:56 - Preventing dangerous coded AI communication 0:40:42 - Unpredictably uncontrollable AI 0:58:01 - What control looks like 1:08:45 - Is AI control evil? 1:24:42 - Can red teams match misaligned AI? 1:36:51 - How expensive is AI monitoring? 1:52:32 - AI control experiments 2:03:50 - GPT-4's aptitude at inserting backdoors 2:14:50 - How AI control relates to the AI safety field 2:39:25 - How AI control relates to previous Redwood Research work 2:49:16 - How people can work on AI control 2:54:07 - Following Buck and Ryan's research   The transcript:  axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt.html Links for Buck and Ryan:  - Buck's twitter/X account: twitter.com/bshlgrs  - Ryan on LessWrong: lesswrong.com/users/ryan_greenblatt  - You can contact both Buck and Ryan by electronic mail at [firstname] [at-sign] rdwrs.com   Main research works we talk about:  - The case for ensuring that powerful AIs are controlled:  lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled  - AI Control: Improving Safety Despite Intentional Subversion: arxiv.org/abs/2312.06942   Other things we mention:  - The prototypical catastrophic AI action is getting root access to its datacenter (aka "Hacking the SSH server"): lesswrong.com/posts/BAzCGCys4BkzGDCWR/the-prototypical-catastrophic-ai-action-is-getting-root  - Preventing language models from hiding their reasoning: arxiv.org/abs/2310.18512  - Improving the Welfare of AIs: A Nearcasted Proposal: <a hr

1 hr 57 min
Nov 26, 2023Episode 26
AI Governance with Elizabeth Seger

The events of this year have highlighted important questions about the governance of artificial intelligence. For instance, what does it mean to democratize AI? And how should we balance benefits and dangers of open-sourcing powerful AI systems such as large language models? In this episode, I speak with Elizabeth Seger about her research on these questions. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: - 0:00:40 - What kinds of AI? - 0:01:30 - Democratizing AI - 0:04:44 - How people talk about democratizing AI - 0:09:34 - Is democratizing AI important? - 0:13:31 - Links between types of democratization - 0:22:43 - Democratizing profits from AI - 0:27:06 - Democratizing AI governance - 0:29:45 - Normative underpinnings of democratization - 0:44:19 - Open-sourcing AI - 0:50:47 - Risks from open-sourcing - 0:56:07 - Should we make AI too dangerous to open source? - 1:00:33 - Offense-defense balance - 1:03:13 - KataGo as a case study - 1:09:03 - Openness for interpretability research - 1:15:47 - Effectiveness of substitutes for open sourcing - 1:20:49 - Offense-defense balance, part 2 - 1:29:49 - Making open-sourcing safer? - 1:40:37 - AI governance research - 1:41:05 - The state of the field - 1:43:33 - Open questions - 1:49:58 - Distinctive governance issues of x-risk - 1:53:04 - Technical research to help governance - 1:55:23 - Following Elizabeth's research The transcript: https://axrp.net/episode/2023/11/26/episode-26-ai-governance-elizabeth-seger.html Links for Elizabeth: - Personal website: elizabethseger.com - Centre for the Governance of AI (AKA GovAI): governance.ai Main papers: - Democratizing AI: Multiple Meanings, Goals, and Methods: arxiv.org/abs/2303.12642 - Open-sourcing highly capable foundation models: an evaluation of risks, benefits, and alternative methods for pursuing open source objectives: papers.ssrn.com/sol3/papers.cfm?abstract_id=4596436 Other research we discuss: - What Do We Mean When We Talk About "AI democratisation"? (blog post): governance.ai/post/what-do-we-mean-when-we-talk-about-ai-democratisation - Democratic Inputs to AI (OpenAI): <a href= "https://o

3 hr 2 min
Oct 3, 2023Episode 25
Cooperative AI with Caspar Oesterheld

Imagine a world where there are many powerful AI systems, working at cross purposes. You could suppose that different governments use AIs to manage their militaries, or simply that many powerful AIs have their own wills. At any rate, it seems valuable for them to be able to cooperatively work together and minimize pointless conflict. How do we ensure that AIs behave this way - and what do we need to learn about how rational agents interact to make that more clear? In this episode, I'll be speaking with Caspar Oesterheld about some of his research on this very topic. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles: hamishdoodles.com Topics we discuss, and timestamps: - 0:00:34 - Cooperative AI - 0:06:21 - Cooperative AI vs standard game theory - 0:19:45 - Do we need cooperative AI if we get alignment? - 0:29:29 - Cooperative AI and agent foundations - 0:34:59 - A Theory of Bounded Inductive Rationality - 0:50:05 - Why it matters - 0:53:55 - How the theory works - 1:01:38 - Relationship to logical inductors - 1:15:56 - How fast does it converge? - 1:19:46 - Non-myopic bounded rational inductive agents? - 1:24:25 - Relationship to game theory - 1:30:39 - Safe Pareto Improvements - 1:30:39 - What they try to solve - 1:36:15 - Alternative solutions - 1:40:46 - How safe Pareto improvements work - 1:51:19 - Will players fight over which safe Pareto improvement to adopt? - 2:06:02 - Relationship to program equilibrium - 2:11:25 - Do safe Pareto improvements break themselves? - 2:15:52 - Similarity-based Cooperation - 2:23:07 - Are similarity-based cooperators overly cliqueish? - 2:27:12 - Sensitivity to noise - 2:29:41 - Training neural nets to do similarity-based cooperation - 2:50:25 - FOCAL, Caspar's research lab - 2:52:52 - How the papers all relate - 2:57:49 - Relationship to functional decision theory - 2:59:45 - Following Caspar's research The transcript: axrp.net/episode/2023/10/03/episode-25-cooperative-ai-caspar-oesterheld.html Links for Caspar: - FOCAL at CMU: www.cs.cmu.edu/~focal/ - Caspar on X, formerly known as Twitter: twitter.com/C_Oesterheld - Caspar's blog: casparoesterheld.com/ - Caspar on Google Scholar: scholar.google.com/citations?user=xe

2 hr 8 min
Jul 27, 2023Episode 24
Superalignment with Jan Leike

Recently, OpenAI made a splash by announcing a new "Superalignment" team. Lead by Jan Leike and Ilya Sutskever, the team would consist of top researchers, attempting to solve alignment for superintelligent AIs in four years by figuring out how to build a trustworthy human-level AI alignment researcher, and then using it to solve the rest of the problem. But what does this plan actually involve? In this episode, I talk to Jan Leike about the plan and the challenges it faces. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles: hamishdoodles.com/   Topics we discuss, and timestamps:  - 0:00:37 - The superalignment team  - 0:02:10 - What's a human-level automated alignment researcher?    - 0:06:59 - The gap between human-level automated alignment researchers and superintelligence    - 0:18:39 - What does it do?    - 0:24:13 - Recursive self-improvement  - 0:26:14 - How to make the AI AI alignment researcher    - 0:30:09 - Scalable oversight    - 0:44:38 - Searching for bad behaviors and internals    - 0:54:14 - Deliberately training misaligned models  - 1:02:34 - Four year deadline    - 1:07:06 - What if it takes longer?  - 1:11:38 - The superalignment team and...    - 1:11:38 - ... governance    - 1:14:37 - ... other OpenAI teams    - 1:18:17 - ... other labs  - 1:26:10 - Superalignment team logistics  - 1:29:17 - Generalization  - 1:43:44 - Complementary research  - 1:48:29 - Why is Jan optimistic?    - 1:58:32 - Long-term agency in LLMs?    - 2:02:44 - Do LLMs understand alignment?  - 2:06:01 - Following Jan's research   The transcript: axrp.net/episode/2023/07/27/episode-24-superalignment-jan-leike.html   Links for Jan and OpenAI:  - OpenAI jobs: openai.com/careers  - Jan's substack: aligned.substack.com  - Jan's twitter: twitter.com/janleike   Links to research and other writings we discuss:  - Introducing Superalignment: openai.com/blog/introducing-superalignment  - Let's Verify Step by Step (process-based feedback on math): arxiv.org/abs/2305.20050  - Planning for AGI and beyond: openai.com/blog/planning-for-agi-and-beyond  - Self-critiquing models for assisting human evaluators: <a href= "ht

2 hr 5 min
Jul 27, 2023Episode 23
Mechanistic Anomaly Detection with Mark Xu

Is there some way we can detect bad behaviour in our AI system without having to know exactly what it looks like? In this episode, I speak with Mark Xu about mechanistic anomaly detection: a research direction based on the idea of detecting strange things happening in neural networks, in the hope that that will alert us of potential treacherous turns. We both talk about the core problems of relating these mechanistic anomalies to bad behaviour, as well as the paper "Formalizing the presumption of independence", which formulates the problem of formalizing heuristic mathematical reasoning, in the hope that this will let us mathematically define "mechanistic anomalies". Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles: hamishdoodles.com/   Topics we discuss, and timestamps:  - 0:00:38 - Mechanistic anomaly detection    - 0:09:28 - Are all bad things mechanistic anomalies, and vice versa?    - 0:18:12 - Are responses to novel situations mechanistic anomalies?    - 0:39:19 - Formalizing "for the normal reason, for any reason"    - 1:05:22 - How useful is mechanistic anomaly detection?  - 1:12:38 - Formalizing the Presumption of Independence    - 1:20:05 - Heuristic arguments in physics    - 1:27:48 - Difficult domains for heuristic arguments    - 1:33:37 - Why not maximum entropy?    - 1:44:39 - Adversarial robustness for heuristic arguments    - 1:54:05 - Other approaches to defining mechanisms  - 1:57:20 - The research plan: progress and next steps  - 2:04:13 - Following ARC's research   The transcript: axrp.net/episode/2023/07/24/episode-23-mechanistic-anomaly-detection-mark-xu.html   ARC links:  - Website: alignment.org  - Theory blog: alignment.org/blog  - Hiring page: alignment.org/hiring   Research we discuss:  - Formalizing the presumption of independence: arxiv.org/abs/2211.06738  - Eliciting Latent Knowledge (aka ELK): alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge  - Mechanistic Anomaly Detection and ELK: alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk <

4 min
Jun 28, 2023
Survey, store closing, Patreon

Very brief survey: bit.ly/axrpsurvey2023 Store is closing in a week! Link: store.axrp.net/ Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast

3 hr 28 min
Jun 15, 2023Episode 22
Shard Theory with Quintin Pope

What can we learn about advanced deep learning systems by understanding how humans learn and form values over their lifetimes? Will superhuman AI look like ruthless coherent utility optimization, or more like a mishmash of contextually activated desires? This episode's guest, Quintin Pope, has been thinking about these questions as a leading researcher in the shard theory community. We talk about what shard theory is, what it says about humans and neural networks, and what the implications are for making AI safe. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Episode art by Hamish Doodles: hamishdoodles.com Topics we discuss, and timestamps: - 0:00:42 - Why understand human value formation? - 0:19:59 - Why not design methods to align to arbitrary values? - 0:27:22 - Postulates about human brains - 0:36:20 - Sufficiency of the postulates - 0:44:55 - Reinforcement learning as conditional sampling - 0:48:05 - Compatibility with genetically-influenced behaviour - 1:03:06 - Why deep learning is basically what the brain does - 1:25:17 - Shard theory - 1:38:49 - Shard theory vs expected utility optimizers - 1:54:45 - What shard theory says about human values - 2:05:47 - Does shard theory mean we're doomed? - 2:18:54 - Will nice behaviour generalize? - 2:33:48 - Does alignment generalize farther than capabilities? - 2:42:03 - Are we at the end of machine learning history? - 2:53:09 - Shard theory predictions - 2:59:47 - The shard theory research community - 3:13:45 - Why do shard theorists not work on replicating human childhoods? - 3:25:53 - Following shardy research The transcript: axrp.net/episode/2023/06/15/episode-22-shard-theory-quintin-pope.html Shard theorist links: - Quintin's LessWrong profile: lesswrong.com/users/quintin-pope - Alex Turner's LessWrong profile: lesswrong.com/users/turntrout - Shard theory Discord: discord.gg/AqYkK7wqAG - EleutherAI Discord: discord.gg/eleutherai Research we discuss: - The Shard Theory Sequence: lesswrong.com/s/nyEFg3AuJpdAozmoX - Pretraining Language Models with Human Preferences: arxiv.org/abs/2302.08582 - Inner alignment in salt-starved rats: <a href= "https://www.lesswrong.com/p

1 hr 56 min
May 2, 2023Episode 21
Interpretability for Engineers with Stephen Casper

Lots of people in the field of machine learning study 'interpretability', developing tools that they say give us useful information about neural networks. But how do we know if meaningful progress is actually being made? What should we want out of these tools? In this episode, I speak to Stephen Casper about these questions, as well as about a benchmark he's co-developed to evaluate whether interpretability tools can find 'Trojan horses' hidden inside neural nets. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast   Topics we discuss, and timestamps:  - 00:00:42 - Interpretability for engineers    - 00:00:42 - Why interpretability?    - 00:12:55 - Adversaries and interpretability    - 00:24:30 - Scaling interpretability    - 00:42:29 - Critiques of the AI safety interpretability community    - 00:56:10 - Deceptive alignment and interpretability  - 01:09:48 - Benchmarking Interpretability Tools (for Deep Neural Networks) (Using Trojan Discovery)    - 01:10:40 - Why Trojans?    - 01:14:53 - Which interpretability tools?    - 01:28:40 - Trojan generation    - 01:38:13 - Evaluation  - 01:46:07 - Interpretability for shaping policy  - 01:53:55 - Following Casper's work   The transcript: axrp.net/episode/2023/05/02/episode-21-interpretability-for-engineers-stephen-casper.html   Links for Casper:  - Personal website: stephencasper.com/  - Twitter: twitter.com/StephenLCasper  - Electronic mail: scasper [at] mit [dot] edu   Research we discuss:  - The Engineer's Interpretability Sequence: alignmentforum.org/s/a6ne2ve5uturEEQK7  - Benchmarking Interpretability Tools for Deep Neural Networks: arxiv.org/abs/2302.10894  - Adversarial Policies beat Superhuman Go AIs: goattack.far.ai/  - Adversarial Examples Are Not Bugs, They Are Features: arxiv.org/abs/1905.02175  - Planting Undetectable Backdoors in Machine Learning Models: arxiv.org/abs/2204.06974  - Softmax Linear Units: transformer-circuits.pub/2022/solu/index.html  - Red-Teaming the Stable Diffusion Safety Filter: arxiv.org/abs/2210.04610   Episode art by Hamish Doodles: <a href= "https://hamishdoo

2 hr 27 min
Apr 12, 2023Episode 20
'Reform' AI Alignment with Scott Aaronson

How should we scientifically think about the impact of AI on human civilization, and whether or not it will doom us all? In this episode, I speak with Scott Aaronson about his views on how to make progress in AI alignment, as well as his work on watermarking the output of language models, and how he moved from a background in quantum complexity theory to working on AI.   Note: this episode was recorded before this story (vice.com/en/article/pkadgm/man-dies-by-suicide-after-talking-with-ai-chatbot-widow-says) emerged of a man committing suicide after discussions with a language-model-based chatbot, that included discussion of the possibility of him killing himself. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast   Topics we discuss, and timestamps:  - 0:00:36 - 'Reform' AI alignment    - 0:01:52 - Epistemology of AI risk    - 0:20:08 - Immediate problems and existential risk    - 0:24:35 - Aligning deceitful AI    - 0:30:59 - Stories of AI doom    - 0:34:27 - Language models    - 0:43:08 - Democratic governance of AI    - 0:59:35 - What would change Scott's mind  - 1:14:45 - Watermarking language model outputs    - 1:41:41 - Watermark key secrecy and backdoor insertion  - 1:58:05 - Scott's transition to AI research    - 2:03:48 - Theoretical computer science and AI alignment    - 2:14:03 - AI alignment and formalizing philosophy    - 2:22:04 - How Scott finds AI research  - 2:24:53 - Following Scott's research   The transcript: axrp.net/episode/2023/04/11/episode-20-reform-ai-alignment-scott-aaronson.html   Links to Scott's things:  - Personal website: scottaaronson.com  - Book, Quantum Computing Since Democritus: amazon.com/Quantum-Computing-since-Democritus-Aaronson/dp/0521199565/  - Blog, Shtetl-Optimized: scottaaronson.blog   Writings we discuss:  - Reform AI Alignment: scottaaronson.blog/?p=6821  - Planting Undetectable Backdoors in Machine Learning Models: arxiv.org/abs/2204.06974

2 min
Feb 7, 2023
Store, Patreon, Video

Store: https://store.axrp.net/ Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Video: https://www.youtube.com/watch?v=kmPFjpEibu0

3 hr 52 min
Feb 4, 2023Episode 19
Mechanistic Interpretability with Neel Nanda

How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking. Topics we discuss, and timestamps: - 00:01:05 - What is mechanistic interpretability? - 00:24:16 - Types of AI cognition - 00:54:27 - Automating mechanistic interpretability - 01:11:57 - Summarizing the papers - 01:24:43 - 'A Mathematical Framework for Transformer Circuits' - 01:39:31 - How attention works - 01:49:26 - Composing attention heads - 01:59:42 - Induction heads - 02:11:05 - 'In-context Learning and Induction Heads' - 02:12:55 - The multiplicity of induction heads - 02:30:10 - Lines of evidence - 02:38:47 - Evolution in loss-space - 02:46:19 - Mysteries of in-context learning - 02:50:57 - 'Progress measures for grokking via mechanistic interpretability' - 02:50:57 - How neural nets learn modular addition - 03:11:37 - The suddenness of grokking - 03:34:16 - Relation to other research - 03:43:57 - Could mechanistic interpretability possibly work? - 03:49:28 - Following Neel's research The transcript: axrp.net/episode/2023/02/04/episode-19-mechanistic-interpretability-neel-nanda.html Links to Neel's things: - Neel on Twitter: twitter.com/NeelNanda5 - Neel on the Alignment Forum: alignmentforum.org/users/neel-nanda-1 - Neel's mechanistic interpretability blog: neelnanda.io/mechanistic-interpretability - TransformerLens: github.com/neelnanda-io/TransformerLens - Concrete Steps to Get Started in Transformer Mechanistic Interpretability: alignmentforum.org/posts/9ezkEb9oGvEi6WoB3/concrete-steps-to-get-started-in-transformer-mechanistic - Neel on YouTube: youtube.com/@neelnanda2469 - 200 Concrete Open Problems in Mechanistic Interpretability: alignmentforum.org/s/yivyHaCAmMJ3CqSyj - Comprehesive mechanistic interpretability explainer: dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J W

1 min
Oct 13, 2022
New podcast - The Filan Cabinet

I have a new podcast, where I interview whoever I want about whatever I want. It's called "The Filan Cabinet", and you can find it wherever you listen to podcasts. The first three episodes are about pandemic preparedness, God, and cryptocurrency. For more details, check out the podcast website (thefilancabinet.com), or search "The Filan Cabinet" in your podcast app.

1 hr 46 min
Sep 3, 2022Episode 18
Concept Extrapolation with Stuart Armstrong

Concept extrapolation is the idea of taking concepts an AI has about the world - say, "mass" or "does this picture contain a hot dog" - and extending them sensibly to situations where things are different - like learning that the world works via special relativity, or seeing a picture of a novel sausage-bread combination. For a while, Stuart Armstrong has been thinking about concept extrapolation and how it relates to AI alignment. In this episode, we discuss where his thoughts are at on this topic, what the relationship to AI alignment is, and what the open questions are.   Topics we discuss, and timestamps:  - 00:00:44 - What is concept extrapolation  - 00:15:25 - When is concept extrapolation possible  - 00:30:44 - A toy formalism  - 00:37:25 - Uniqueness of extrapolations  - 00:48:34 - Unity of concept extrapolation methods  - 00:53:25 - Concept extrapolation and corrigibility  - 00:59:51 - Is concept extrapolation possible?  - 01:37:05 - Misunderstandings of Stuart's approach  - 01:44:13 - Following Stuart's work   The transcript: axrp.net/episode/2022/09/03/episode-18-concept-extrapolation-stuart-armstrong.html   Stuart's startup, Aligned AI: aligned-ai.com   Research we discuss:  - The Concept Extrapolation sequence: alignmentforum.org/s/u9uawicHx7Ng7vwxA  - The HappyFaces benchmark: github.com/alignedai/HappyFaces  - Goal Misgeneralization in Deep Reinforcement Learning: arxiv.org/abs/2105.14111

1 hr
Aug 21, 2022Episode 17
Training for Very High Reliability with Daniel Ziegler

Sometimes, people talk about making AI systems safe by taking examples where they fail and training them to do well on those. But how can we actually do this well, especially when we can't use a computer program to say what a 'failure' is? In this episode, I speak with Daniel Ziegler about his research group's efforts to try doing this with present-day language models, and what they learned. Listeners beware: this episode contains a spoiler for the Animorphs franchise around minute 41 (in the 'Fanfiction' section of the transcript).   Topics we discuss, and timestamps:  - 00:00:40 - Summary of the paper  - 00:02:23 - Alignment as scalable oversight and catastrophe minimization  - 00:08:06 - Novel contribtions  - 00:14:20 - Evaluating adversarial robustness  - 00:20:26 - Adversary construction  - 00:35:14 - The task  - 00:38:23 - Fanfiction  - 00:42:15 - Estimators to reduce labelling burden  - 00:45:39 - Future work  - 00:50:12 - About Redwood Research   The transcript: axrp.net/episode/2022/08/21/episode-17-training-for-very-high-reliability-daniel-ziegler.html   Daniel Ziegler on Google Scholar: scholar.google.com/citations?user=YzfbfDgAAAAJ   Research we discuss:  - Daniel's paper, Adversarial Training for High-Stakes Reliability: arxiv.org/abs/2205.01663  - Low-stakes alignment: alignmentforum.org/posts/TPan9sQFuPP6jgEJo/low-stakes-alignment  - Red Teaming Language Models with Language Models: arxiv.org/abs/2202.03286  - Uncertainty Estimation for Language Reward Models: arxiv.org/abs/2203.07472  - Eliciting Latent Knowledge: docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit

1 hr 4 min
Jul 1, 2022Episode 16
Preparing for Debate AI with Geoffrey Irving

Many people in the AI alignment space have heard of AI safety via debate - check out AXRP episode 6 (axrp.net/episode/2021/04/08/episode-6-debate-beth-barnes.html) if you need a primer. But how do we get language models to the stage where they can usefully implement debate? In this episode, I talk to Geoffrey Irving about the role of language models in AI safety, as well as three projects he's done that get us closer to making debate happen: using language models to find flaws in themselves, getting language models to back up claims they make with citations, and figuring out how uncertain language models should be about the quality of various answers.   Topics we discuss, and timestamps:  - 00:00:48 - Status update on AI safety via debate  - 00:10:24 - Language models and AI safety  - 00:19:34 - Red teaming language models with language models  - 00:35:31 - GopherCite  - 00:49:10 - Uncertainty Estimation for Language Reward Models  - 01:00:26 - Following Geoffrey's work, and working with him   The transcript: axrp.net/episode/2022/07/01/episode-16-preparing-for-debate-ai-geoffrey-irving.html   Geoffrey's twitter: twitter.com/geoffreyirving   Research we discuss:  - Red Teaming Language Models With Language Models: arxiv.org/abs/2202.03286  - Teaching Language Models to Support Answers with Verified Quotes, aka GopherCite: arxiv.org/abs/2203.11147  - Uncertainty Estimation for Language Reward Models: arxiv.org/abs/2203.07472  - AI Safety via Debate: arxiv.org/abs/1805.00899  - Writeup: progress on AI safety via debate: lesswrong.com/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1  - Eliciting Latent Knowledge: ai-alignment.com/eliciting-latent-knowledge-f977478608fc  - Training Compute-Optimal Large Language Models, aka Chinchilla: arxiv.org/abs/2203.15556

1 hr 36 min
May 23, 2022Episode 15
Natural Abstractions with John Wentworth

Why does anybody care about natural abstractions? Do they somehow relate to math, or value learning? How do E. coli bacteria find sources of sugar? All these questions and more will be answered in this interview with John Wentworth, where we talk about his research plan of understanding agency via natural abstractions. Topics we discuss, and timestamps:  - 00:00:31 - Agency in E. Coli  - 00:04:59 - Agency in financial markets  - 00:08:44 - Inferring agency in real-world systems  - 00:16:11 - Selection theorems  - 00:20:22 - Abstraction and natural abstractions  - 00:32:42 - Information at a distance  - 00:39:20 - Why the natural abstraction hypothesis matters  - 00:44:48 - Unnatural abstractions used by humans?  - 00:49:11 - Probability, determinism, and abstraction  - 00:52:58 - Whence probabilities in deterministic universes?  - 01:02:37 - Abstraction and maximum entropy distributions  - 01:07:39 - Natural abstractions and impact  - 01:08:50 - Learning human values  - 01:20:47 - The shape of the research landscape  - 01:34:59 - Following John's work   The transcript: axrp.net/episode/2022/05/23/episode-15-natural-abstractions-john-wentworth.html   John on LessWrong: lesswrong.com/users/johnswentworth   Research that we discuss:  - Alignment by default - contains the natural abstraction hypothesis: alignmentforum.org/posts/Nwgdq6kHke5LY692J/alignment-by-default#Unsupervised__Natural_Abstractions  - The telephone theorem: alignmentforum.org/posts/jJf4FrfiQdDGg7uco/information-at-a-distance-is-mediated-by-deterministic  - Generalizing Koopman-Pitman-Darmois: alignmentforum.org/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois  - The plan: alignmentforum.org/posts/3L46WGauGpr7nYubu/the-plan  - Understanding deep learning requires rethinking generalization - deep learning can fit random data: arxiv.org/abs/1611.03530  - A closer look at memorization in deep networks - deep learning learns before memorizing: arxiv.org/abs/1706.05394  - Zero-shot coordination: arxiv.org/abs/2003.02979  - A new formalism,

1 hr 47 min
Apr 5, 2022Episode 14
Infra-Bayesian Physicalism with Vanessa Kosoy

Late last year, Vanessa Kosoy and Alexander Appel published some research under the heading of "Infra-Bayesian physicalism". But wait - what was infra-Bayesianism again? Why should we care? And what does any of this have to do with physicalism? In this episode, I talk with Vanessa Kosoy about these questions, and get a technical overview of how infra-Bayesian physicalism works and what its implications are. Topics we discuss, and timestamps: - 00:00:48 - The basics of infra-Bayes - 00:08:32 - An invitation to infra-Bayes - 00:11:23 - What is naturalized induction? - 00:19:53 - How infra-Bayesian physicalism helps with naturalized induction - 00:19:53 - Bridge rules - 00:22:22 - Logical uncertainty - 00:23:36 - Open source game theory - 00:28:27 - Logical counterfactuals - 00:30:55 - Self-improvement - 00:32:40 - How infra-Bayesian physicalism works - 00:32:47 - World models - 00:39-20 - Priors - 00:42:53 - Counterfactuals - 00:50:34 - Anthropics - 00:54:40 - Loss functions - 00:56:44 - The monotonicity principle - 01:01:57 - How to care about various things - 01:08:47 - Decision theory - 01:19:53 - Follow-up research - 01:20:06 - Infra-Bayesian physicalist quantum mechanics - 01:26:42 - Infra-Bayesian physicalist agreement theorems - 01:29:00 - The production of infra-Bayesianism research - 01:35:14 - Bridge rules and malign priors - 01:45:27 - Following Vanessa's work The transcript: axrp.net/episode/2022/04/05/episode-14-infra-bayesian-physicalism-vanessa-kosoy.html Vanessa on the Alignment Forum: alignmentforum.org/users/vanessa-kosoy Research that we discuss: - Infra-Bayesian physicalism: a formal theory of naturalized induction: alignmentforum.org/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized - Updating ambiguous beliefs (contains the infra-Bayesian update rule): sciencedirect.com/science/article/abs/pii/S0022053183710033 - Functional Decision Theory: A New Theory of Instrumental Rationality: arxiv.org/abs/1710.05060 - Space-time embedded intelligence: cs.utexas.edu/~ring/Orseau,%20Ring%3B%20Space-Time%20Embedded%20Intelligence,%20AGI%202012.pdf - Attacking the

1 hr 33 min
Mar 31, 2022Episode 13
First Principles of AGI Safety with Richard Ngo

How should we think about artificial general intelligence (AGI), and the risks it might pose? What constraints exist on technical solutions to the problem of aligning superhuman AI systems with human intentions? In this episode, I talk to Richard Ngo about his report analyzing AGI safety from first principles, and recent conversations he had with Eliezer Yudkowsky about the difficulty of AI alignment. Topics we discuss, and timestamps: - 00:00:40 - The nature of intelligence and AGI - 00:01:18 - The nature of intelligence - 00:06:09 - AGI: what and how - 00:13:30 - Single vs collective AI minds - 00:18:57 - AGI in practice - 00:18:57 - Impact - 00:20:49 - Timing - 00:25:38 - Creation - 00:28:45 - Risks and benefits - 00:35:54 - Making AGI safe - 00:35:54 - Robustness of the agency abstraction - 00:43:15 - Pivotal acts - 00:50:05 - AGI safety concepts - 00:50:05 - Alignment - 00:56:14 - Transparency - 00:59:25 - Cooperation - 01:01:40 - Optima and selection processes - 01:13:33 - The AI alignment research community - 01:13:33 - Updates from the Yudkowsky conversation - 01:17:18 - Corrections to the community - 01:23:57 - Why others don't join - 01:26:38 - Richard Ngo as a researcher - 01:28:26 - The world approaching AGI - 01:30:41 - Following Richard's work The transcript: axrp.net/episode/2022/03/31/episode-13-first-principles-agi-safety-richard-ngo.html Richard on the Alignment Forum: alignmentforum.org/users/ricraz Richard on Twitter: twitter.com/RichardMCNgo The AGI Safety Fundamentals course: eacambridge.org/agi-safety-fundamentals Materials that we mention: - AGI Safety from First Principles: alignmentforum.org/s/mzgtmmTKKn5MuCzFJ - Conversations with Eliezer Yudkowsky: alignmentforum.org/s/n945eovrA3oDueqtq - The Bitter Lesson: incompleteideas.net/IncIdeas/BitterLesson.html - Metaphors We Live By: en.wikipedia.org/wiki/Metaphors_We_Live_By - The Enigma of Reason: hup.harvard.edu/catalog.php?isbn=9780674237827 - Draft report on AI timelines, by Ajeya Cotra: <a href= "https://www.alignmentforum.org/posts/KrJfoZzp

Reviews

No reviews yet.

Discussion (0)

No comments yet. Be the first to start the discussion!