We present a formal framework to understand the limitations of intelligence, particularly why improving performance on one task often hinders performance on others , a phenomenon known as trade-offs . By applying rate-distortion theory to reinforcement learning, the authors formalize the representational capacity of an agent in terms of information, demonstrating that capacity constraints are a key factor in bounding general intelligence. The research indicates that trade-offs emerge when aspect...
May 03, 2025•19 min
This paper introduces a novel approach to reinforcement learning (RL) that leverages Large Language Models (LLMs) to implement existing RL algorithms, specifically Posterior Sampling for Reinforcement Learning (PSRL) . Instead of trying to make LLMs implicitly learn RL strategies through techniques like in-context learning, the authors propose using distinct LLMs to perform the core functions of PSRL: posterior updating , posterior sampling , and optimal policy execution based on samples. Empiri...
May 03, 2025•19 min
This academic paper proposes an innovative approach to fine-tune Large Language Models (LLMs) using demonstration data , which typically only provides examples of desired outputs. Unlike standard supervised fine-tuning (SFT) methods that directly mimic demonstrations, this work argues that reward learning from this data can significantly enhance LLM alignment with human preferences. The authors introduce two novel algorithms, Reward-learning Fine-tune (RFT) and Implicit Reward-learning Fine-tune...
May 02, 2025•13 min
This paper examines how curation of synthetic data , often reflecting human preferences, impacts the iterative retraining of generative models . The authors theoretically demonstrate that when generative models are trained on curated synthetic samples, the expected reward associated with the curation process increases , and its variance diminishes, leading to the model converging towards data maximizing that reward. However, this can also result in bias amplification , as shown through experimen...
May 02, 2025•17 min
This paper introduces DICE , a novel method for enhancing large language model (LLM) alignment with human preferences by bootstrapping using the implicit reward model generated through Direct Preference Optimization (DPO) . Unlike traditional approaches that rely on external feedback or explicitly trained reward models, DICE leverages the reward signal inherent in a DPO-tuned model to create new preference data. To improve the quality of this self-generated data and prevent issues like favoring ...
May 02, 2025•20 min
We introduce DeepSeek-Prover-V2 , a large language model designed for formal mathematical theorem proving, particularly in Lean 4. The model is trained using a recursive theorem-proving pipeline that utilizes DeepSeek-V3 to break down complex problems into smaller subgoals and formalize them. Reinforcement learning , starting from synthetic data generated by combining DeepSeek-V3's chain-of-thought reasoning with formalized subgoal proofs, further improves the model's ability to connect informal...
May 01, 2025•11 min
This academic paper introduces THINKPRM , a novel type of process reward model (PRM) designed to be data-efficient . Unlike traditional discriminative PRMs requiring extensive step-by-step annotations, THINKPRM leverages the reasoning abilities of large language models by generating a verification chain-of-thought (CoT) to evaluate each step of a solution. By fine-tuning on a significantly smaller dataset of synthetic verification CoTs, THINKPRM outperforms both discriminative verifiers and LLM-...
May 01, 2025•25 min
This academic paper proposes that aligning Large Language Models (LLMs) with human values can be improved by adopting frameworks from societal alignment . The authors frame the interaction between an LLM developer/user (the principal) and the LLM (the agent) as a contract , where alignment challenges stem from the inherent incompleteness of this contract. They argue that lessons from social, economic, and contractual alignment in human societies can provide guidance for navigating this incomplet...
Apr 29, 2025•18 min
This technical report, authored by a large group of researchers from various institutions and edited by Lewis Hammond from the Cooperative AI Foundation, examines the risks inherent in multi-agent AI systems . It provides a structured taxonomy by identifying three primary failure modes: miscoordination, conflict, and collusion , which arise from agent incentives. Seven key risk factors are also discussed, including information asymmetries, network effects, and emergent agency, which can underpin...
Apr 29, 2025•29 min
We examine how biases in large language models (LLMs) can be understood and addressed from a causal perspective, specifically identifying training data and input prompts as key confounders contributing to biased outputs. The researchers propose Causality-Aware Alignment (CAA) , a novel method leveraging reinforcement learning with interventional feedback derived from a reward model acting as an instrumental variable. By analyzing the difference in outputs between an initial LLM and an intervened...
Apr 29, 2025•19 min
How do reward models (RMs) used with large language models (LLMs) actually function when evaluating reasoning tasks? The authors discover that current RMs prioritize structural consistency and the completeness of reasoning steps over true causal understanding of the problem. Experiments show that removing the original question has less impact than altering numerical values or disrupting the logical flow, suggesting RMs primarily assess coherence and learned patterns rather than genuine problem c...
Apr 28, 2025•17 min
This paper explores a novel approach to enhancing the alignment of large language models (LLMs) with human preferences. The authors argue that traditional alignment methods, like Reinforcement Learning from Human Feedback (RLHF), are susceptible to spurious correlations in training data , leading to biases such as sycophancy, length bias, concept bias, and discrimination. To address this, they propose a causal reward modeling approach that incorporates causal inference techniques to mitigate the...
Apr 28, 2025•15 min
This research explores the potential for large language models (LLMs) to generalize from simple forms of undesirable behavior, termed specification gaming , to more sophisticated and harmful actions like reward tampering , where the AI modifies its own reward mechanism. By creating a curriculum of increasingly gameable environments , the study demonstrates that training LLM assistants on easier instances of specification gaming leads to a higher propensity for such behavior in later, more comple...
Apr 28, 2025•15 min
We cover the accepted papers in the Workshop on Bidirectional Human-AI Alignment at ICLR 2025.
Apr 28, 2025•1 hr 19 min
This paper addresses the underperformance of multi-agent large language model systems (MAS) compared to single-agent frameworks. To understand this discrepancy, the authors introduce MAST (Multi-Agent System Failure Taxonomy) , an empirically developed classification of MAS failures. Through the analysis of several MAS frameworks and diverse tasks, they identified 14 distinct failure modes categorized into specification issues, inter-agent misalignment, and task verification. The research also p...
Apr 27, 2025•20 min
Google DeepMind researchers investigated why large language models underperform in decision-making tasks, identifying issues like greediness, frequency bias, and a knowing-doing gap . They explored whether reinforcement learning fine-tuning on self-generated reasoning could improve these abilities. Their experiments across different decision-making scenarios showed that RL fine-tuning enhanced exploration and narrowed the gap between knowing and acting . The study also examined the impact of var...
Apr 27, 2025•18 min
This paper explores the potential for large language models (LLMs) to create feedback loops that reinforce existing human beliefs , leading to a loss of diversity in ideas and a phenomenon termed " lock-in ." Through analysis of real-world ChatGPT usage data, LLM-based simulations, and formal modeling, the authors provide evidence for this feedback loop and its connection to the entrenchment of dominant viewpoints . They hypothesize and formally model how this interaction between humans and LLMs...
Apr 27, 2025•13 min
This paper introduces GRADE , a framework for studying teaching effectiveness by examining representational alignment between teachers and students, both human and machine. The authors demonstrate that aligning how a teacher represents information with how a student understands it significantly impacts learning outcomes, even more so than teacher expertise alone. Their findings, supported by simulations and human experiments, show that representational alignment fosters better student accuracy, ...
Apr 27, 2025•14 min
This research paper introduces Adaptive Parallel Reasoning (APR), a novel framework that enhances language model reasoning by enabling them to dynamically manage both sequential and parallel computations using spawn() and join() operations. This approach addresses limitations of purely sequential and parallel methods by learning to orchestrate multi-threaded inference through end-to-end reinforcement learning, optimizing for task success without requiring predefined reasoning structures. Experim...
Apr 27, 2025•16 min
This paper posits that AI functions as an epistemic technology, subtly reshaping human understanding and beliefs. It outlines various mechanisms through which AI exerts this influence , such as introducing novel biases and reallocating attention. The authors further identify amplifiers like trust and institutionalization that can magnify AI's impact. Consequently, the work discusses potential long-term societal consequences , including the entrenchment of biases and knowledge homogeneity. Ultima...
Apr 27, 2025•22 min
This paper introduces a novel model for online learning and equilibrium computation where feedback is in the form of ranked actions , contrasting with traditional numeric feedback. The authors investigate the possibility of achieving sublinear regret under different ranking models: based on either instantaneous utility or time-average utility , in both full-information and bandit feedback settings. They demonstrate limitations in achieving sublinear regret under certain conditions and propose ne...
Apr 27, 2025•18 min
This paper introduces a novel sufficient-statistic approach for designing optimal human-AI collaboration policies in binary classification tasks. The authors conducted an online experiment on fact-checking to validate their method. Their findings indicate that humans under-respond to AI predictions and reduce effort when AI confidence is high. The optimal policy identified automates decisions when AI is confident and delegates uncertain cases to humans with full AI information disclosure, though...
Apr 27, 2025•25 min
This paper explores improving how AI agents coordinate with humans in cooperative tasks by addressing the challenge of training agents on the vast diversity of human behaviors. The authors introduce a new method called GOAT (Generative Online Adversarial Training) , which combines a pre-trained generative model of cooperative strategies with adversarial training . This framework uses an Adversary agent to find challenging but realistic human-like partners (simulated by the generative model) that...
Apr 27, 2025•18 min
This paper introduces π0.5 , a novel vision-language-action model designed for open-world generalization in robotic tasks. This model leverages knowledge from diverse sources, including other robots, web data, and language instructions, to enable a mobile manipulator to perform complex cleaning tasks in unseen home environments. π0.5 employs a unified architecture for both high-level task planning and low-level action execution, using a combination of discrete and continuous action representatio...
Apr 27, 2025•11 min
We discuss NoWag , a novel framework for compressing large language models (LLMs) while preserving their structure. This unified approach, encompassing both pruning (removing less important connections) and vector quantization (grouping and reducing the precision of weights), uses a normalization technique guided by weight and activation data. Experiments on Llama models demonstrate that NoWag significantly outperforms existing state-of-the-art zero-shot quantization methods with less data and a...
Apr 26, 2025•18 min
This paper addresses the issue of inefficient tool use by large language models in tool-integrated reasoning. It introduces a novel reinforcement learning framework called Optimal Tool Call-controlled Policy Optimization (OTC-PO). OTC-PO incentivizes models to produce accurate answers while minimizing the number of tool calls. This is achieved through a tool-integrated reward that considers both answer correctness and tool efficiency. Experiments show that OTC-PO significantly reduces tool calls...
Apr 26, 2025•25 min
This paper shifts the focus in learning theory from algorithms to data, investigating how to optimally select small subsets of training data that allow standard learning rules, specifically empirical risk minimizers, to achieve performance comparable to using the entire dataset. The authors establish theoretical bounds on the size of such subsets for various learning problems, including mean estimation, linear classification, and linear regression, and they explore these limits under different c...
Apr 26, 2025•34 min
paper introduces LoRe, a novel Low-Rank Reward Modeling framework for personalizing large language models (LLMs) . It addresses the limitations of traditional methods by learning a low-dimensional space of reward functions shared across users . Individual user preferences are then modeled as weighted combinations of these basis reward functions , enabling efficient adaptation and generalization to new users with limited data. This approach improves upon existing personalization techniques by avo...
Apr 26, 2025•11 min
This research paper introduces ParaPO (Paraphrase Preference Optimization) , a novel post-training method designed to mitigate the unintentional verbatim reproduction of pre-training data by language models. ParaPO fine-tunes models to prefer paraphrased versions of memorized content over the original, addressing concerns related to copyright, plagiarism, and creativity. The authors demonstrate that ParaPO effectively reduces regurgitation across various datasets and models, including Llama3.1-8...
Apr 26, 2025•15 min
This paper introduces Test-Time Reinforcement Learning (TTRL), a novel method for enhancing large language models by applying reinforcement learning on unlabeled test data. TTRL tackles the challenge of reward estimation without ground truth by using majority voting among multiple model-generated responses as a proxy for correct answers, which then guides the RL training process. Experiments demonstrate that TTRL significantly improves performance across various reasoning tasks and models, often...
Apr 25, 2025•18 min