We investigate how various reward signals, even spurious and random ones, impact the performance of different language models fine-tuned for mathematical reasoning using Reinforcement Learning from Verbose Reasoning (RLVR). The research demonstrates that while Qwen models show significant improvement even with weak or incorrect rewards, this benefit is not universal, with Llama and OLMo models showing little to no gain. The study links this disparity to pre-existing reasoning patterns, particula...
May 27, 2025•15 min
This paper critically examines the best-of-n policy , a common method for aligning generative language models by selecting the highest-reward sample from $n$ options drawn from a reference policy . It disproves a widely-used analytical formula for the KL divergence between the best-of-n policy and the reference, proving that the formula is only an upper bound . The authors analyze the conditions under which this bound is tight or loose and propose a new, more accurate estimator for the KL diverg...
May 27, 2025•15 min
This research paper explores using score matching —a technique for estimating the gradient of a data distribution's logarithm—to perform causal discovery in nonlinear additive noise models . The authors demonstrate that the causal graph's structure can be inferred from the score function , particularly its Jacobian . They propose a new algorithm, SCORE , which estimates the causal order by analyzing the variance of the score's Jacobian diagonal elements and then prunes edges using established me...
May 27, 2025•19 min
This research paper focuses on improving score-based generative models (SBGMs) to produce high-quality, high-resolution images. The authors identify limitations of existing SBGMs, specifically their inability to scale to higher resolutions and occasional training instability. They propose new theoretical analyses and techniques for selecting noise scales , an efficient method for incorporating noise information , and a process for configuring annealed Langevin dynamics , all of which are crucial...
May 27, 2025•20 min
This paper introduces Disagreement-Aware Confidence Alignment (DACA) , an unsupervised method for calibrating the confidence of post-trained large language models (PoLMs) . While pre-trained language models (PLMs) are typically well-calibrated, post-training can lead to over-confidence , especially with limited labeled data. DACA addresses this by leveraging the well-calibrated confidence of PLMs on unlabeled data, specifically by optimizing calibration parameters only on examples where PLM and ...
May 27, 2025•14 min
This paper introduces AlphaEvolve, a system designed to automate the discovery of advanced algorithms by leveraging large language models (LLMs) within an evolutionary framework. The system works by taking a user-defined problem and evaluation criteria, then iteratively generating and improving code solutions through an evolutionary process powered by LLM ensembles. AlphaEvolve has successfully applied this method to solve complex open problems in areas like matrix multiplication and various fie...
May 27, 2025•24 min
This academic paper presents vec2vec , a novel method for translating text embeddings between different models without requiring paired data or prior knowledge of the encoders. The authors demonstrate that this unsupervised technique successfully aligns embeddings from various models into a universal latent space , preserving the geometric structure and semantics of the original data. They show that these translated embeddings can then be used to extract sensitive information from documents, eve...
May 27, 2025•23 min
This academic paper investigates how humans represent and infer goals , proposing that goals can be formalized as reward-producing programs . The researchers developed a physics-based game environment where participants created and demonstrated novel goals. By collecting natural language descriptions and formal scoring criteria , they analyzed the relationship between goal complexity, reward structure, and perceived difficulty . A proof-of-concept computational method is presented, demonstrating...
May 27, 2025•19 min
This paper introduce Trial-Error-Explain In-Context Learning (TICL) , a method for personalizing large language models (LLMs) to match individual user writing styles without requiring model fine-tuning. TICL expands the in-context learning prompt by adding model-generated negative examples and explanations that highlight discrepancies from the target style. Evaluations demonstrate that TICL significantly outperforms existing methods in generating text that stylistically aligns with authors, indi...
May 27, 2025•12 min
This research investigates how little training data is needed for Reinforcement Learning with Verifiable Reward ( RLVR ) to significantly boost the mathematical reasoning abilities of large language models (LLMs) . Surprisingly, the authors demonstrate that training on even just one carefully chosen example can achieve performance comparable to using datasets containing thousands, resulting in substantial improvements on mathematical benchmarks. They explore the phenomena observed with such limi...
May 27, 2025•13 min
This paper introduces Test-Time Reinforcement Learning (TTRL) , a novel method enabling Large Language Models (LLMs) to improve performance on unlabeled test data using Reinforcement Learning (RL). TTRL overcomes the lack of ground-truth labels by employing majority voting on multiple model outputs to estimate rewards , essentially allowing models to self-supervise their training. The research demonstrates that this approach leads to significant performance gains across various reasoning tasks a...
May 27, 2025•18 min
This paper presents research exploring whether a model-free reinforcement learning agent, specifically a DRC agent playing the game Sokoban , learns to plan. Through a concept-based interpretability methodology involving probing for planning-relevant concepts like future agent and box movements, investigating how plans are formed internally, and verifying the causal link between internal representations and behavior through interventions, the authors provide mechanistic evidence of emergent plan...
May 26, 2025•15 min
This paper proposes a new reward system for large language models (LLMs) called agentic reward modeling, which aims to create more reliable rewards by integrating human preferences with verifiable correctness signals . An empirical implementation, named REWARDAGENT , is presented, which combines human preference rewards with signals related to factuality and instruction following . Extensive experiments show that REWARDAGENT outperforms traditional reward models on benchmarks and in practical ap...
May 26, 2025•13 min
This research introduces a novel method for aligning large language models (LLMs) with human preferences while avoiding common pitfalls like reward hacking and spurious correlations . The authors propose a causal reward modeling approach that integrates causal inference and counterfactual invariance to ensure that reward predictions are based on true relationships rather than irrelevant data patterns. Through experiments on various datasets, including those focused on sycophancy, length, concept...
May 26, 2025•13 min
This paper introduces an approach to optimize the computational resources used by language models (LMs) when responding to different queries. Instead of applying the same level of processing to every request, the method learns to predict how much a query would benefit from more intensive computation and then allocates resources adaptively . This is achieved by training a model to estimate the potential improvement in output quality (marginal reward) for a given input and computation budget. The ...
May 26, 2025•20 min
This paper introduces COCO-FACET , a new benchmark dataset designed to evaluate text-to-image retrieval models on attribute-focused queries , which differ from traditional general image caption queries. The researchers demonstrate that existing models, including CLIP-like and MLLM-based models, struggle with these specific attributes , especially those less prominent in images or less explored in training data like time and weather. To address this, they propose using promptable image embeddings...
May 26, 2025•16 min
This paper introduces Unified Fine-Tuning (UFT) , a novel method for enhancing the reasoning capabilities of large language models (LLMs) by integrating supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) . The authors argue that traditional SFT and RFT have limitations, with SFT potentially overfitting and RFT being constrained by the base model's initial capacity. UFT addresses these issues by blending memorization through supervised signals (hints) with exploration through reinfo...
May 26, 2025•14 min
This paper investigates challenges in high-dimensional Bayesian optimization (HDBO) , particularly focusing on vanishing gradients that hinder the effectiveness of standard methods in high dimensions. It highlights how initialization schemes and acquisition function optimization contribute to these issues, and explains why recent methods succeed by promoting local search behaviors and mitigating vanishing gradients. The authors propose a simple maximum likelihood estimation (MLE) based approach ...
May 26, 2025•20 min
This academic paper introduces Simple Energy Adaptation (SEA) , a novel method for aligning large language models (LLMs) with human preferences during the inference phase. Unlike traditional methods that rely on discrete searches within a limited set of responses from the base model, SEA formulates alignment as an iterative optimization process in a continuous latent space. By applying gradient-based Langevin Dynamics to the continuous output logits, guided by an energy function derived from the...
May 25, 2025•16 min
This academic paper explores methods to improve the efficiency and accuracy of Large Language Models (LLMs) during the final step of generating responses, known as test-time scaling. The authors propose Self-Calibration , a technique to teach LLMs to reliably estimate their own confidence in an answer with a single pass. By incorporating these calibrated confidence scores, they develop efficient test-time scaling strategies , such as stopping repeated sampling early when a confident answer is fo...
May 25, 2025•24 min
This paper explores a novel perspective on conformal prediction , a method for providing performance guarantees for machine learning models without assuming a specific data distribution. The authors propose viewing conformal prediction through a Bayesian lens , specifically utilizing Bayesian quadrature , a technique for estimating integrals with uncertainty. They argue that this approach addresses limitations of traditional frequentist-based conformal prediction, offering more interpretable gua...
May 25, 2025•23 min
This paper, a research paper from Google DeepMind, introduces a novel approach called Embed-then-Regress for Bayesian Optimization . This method leverages the ability of language models to embed string representations of various types of inputs, including synthetic, combinatorial, and hyperparameter configurations, into fixed-length vectors. These vectors then serve as features for a Transformer-based regressor trained using in-context learning. The paper demonstrates that this approach achieves...
May 25, 2025•27 min
This document presents Self-Evolving Curriculum (SEC) , a novel method for reinforcement learning (RL) fine-tuning of large language models (LLMs) to enhance their reasoning capabilities. SEC frames curriculum selection as a non-stationary Multi-Armed Bandit (MAB) problem , where problem categories represent individual "arms". It learns a curriculum policy concurrently with LLM training , utilizing the absolute advantage from policy gradient methods as a metric for learning gain to dynamically a...
May 25, 2025•15 min
This paper introduces online decision-focused learning (DFL) , a framework for training predictive models used in dynamic, sequential decision-making tasks where data distributions and objectives change over time. The authors propose a new algorithm, Decision-Focused Online Gradient Descent (DF-OGD) , which handles the non-differentiable and non-convex nature of the decision objective by regularizing the problem and using an optimistic approach with perturbations. Theoretical dynamic regret boun...
May 25, 2025•21 min
This paper introduces FisherSFT , a method for making supervised fine-tuning (SFT) of large language models (LLMs) more data-efficient by selecting the most informative training examples . The key concept is to choose examples that maximize information gain , which is approximated by evaluating the Hessian of the LLM's log-likelihood . This approach uses a computationally efficient approximation based on linearizing the LLM's last layer and employs a greedy algorithm to select sentences with the...
May 25, 2025•14 min
This academic paper explores a novel technique for automatic reward shaping in reinforcement learning , specifically addressing the challenge of learning from offline data that may contain unobserved confounding factors . The authors propose using causal state value upper bounds , derived from this confounded data, as potential functions for Potential-Based Reward Shaping (PBRS). They demonstrate theoretically and through simulations that their method, when applied to a model-free learner like Q...
May 25, 2025•21 min
This academic paper introduces Trajectory Bellman Residual Minimization (TBRM) , a novel value-based reinforcement learning algorithm designed to enhance the reasoning capabilities of large language models (LLMs), particularly in mathematical problem-solving. Unlike prevailing policy-based methods like PPO and GRPO, TBRM streamlines the training process by eliminating the need for critics, importance sampling, or clipping mechanisms , requiring only a single rollout per prompt . The authors pres...
May 25, 2025•18 min
This paper critically examines the best-of-n policy , a common method for aligning generative language models by selecting the highest-reward sample from $n$ options drawn from a reference policy . It disproves a widely-used analytical formula for the KL divergence between the best-of-n policy and the reference, proving that the formula is only an upper bound . The authors analyze the conditions under which this bound is tight or loose and propose a new, more accurate estimator for the KL diverg...
May 25, 2025•14 min
This academic paper explores methods to improve Bayesian optimization (BO) , a process for finding optimal settings for complex, costly functions. The authors address the challenge of maximizing acquisition functions , which are heuristics guiding BO's search and are often difficult to optimize, especially when evaluating multiple points simultaneously. They demonstrate that Monte Carlo integration allows for gradient-based optimization of a wide range of acquisition functions. Furthermore, they...
May 24, 2025•19 min
This paper introduces Bayesian Prompt Ensembles (BayesPE) , a novel method for quantifying uncertainty in black-box large language models (LLMs) without requiring access to their internal parameters or retraining. BayesPE achieves this by ensembling the outputs of an LLM prompted with various semantically equivalent instructions, learning the optimal weighting for each prompt through approximate Bayesian variational inference on a small validation dataset. The paper demonstrates that this approa...
May 24, 2025•17 min