Best AI papers explained

Enoch H. Kang•podcasters.spotify.com

Technology

Cut through the noise. We curate and break down the most important AI papers so you don’t have to.

Last refreshed: January 25th, 2026 at 3:13 PM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Qwen 2.5, RL, and Random Rewards

We investigate how various reward signals, even spurious and random ones, impact the performance of different language models fine-tuned for mathematical reasoning using Reinforcement Learning from Verbose Reasoning (RLVR). The research demonstrates that while Qwen models show significant improvement even with weak or incorrect rewards, this benefit is not universal, with Llama and OLMo models showing little to no gain. The study links this disparity to pre-existing reasoning patterns, particula...

May 27, 2025•15 min

Theoretical guarantees on the best-of-n alignment policy

This paper critically examines the best-of-n policy , a common method for aligning generative language models by selecting the highest-reward sample from $n$ options drawn from a reference policy . It disproves a widely-used analytical formula for the KL divergence between the best-of-n policy and the reference, proving that the formula is only an upper bound . The authors analyze the conditions under which this bound is tight or loose and propose a new, more accurate estimator for the KL diverg...

May 27, 2025•15 min

Score Matching Enables Causal Discovery of Nonlinear Additive Noise Models

This research paper explores using score matching —a technique for estimating the gradient of a data distribution's logarithm—to perform causal discovery in nonlinear additive noise models . The authors demonstrate that the causal graph's structure can be inferred from the score function , particularly its Jacobian . They propose a new algorithm, SCORE , which estimates the causal order by analyzing the variance of the score's Jacobian diagonal elements and then prunes edges using established me...

May 27, 2025•19 min

Improved Techniques for Training Score-Based Generative Models

This research paper focuses on improving score-based generative models (SBGMs) to produce high-quality, high-resolution images. The authors identify limitations of existing SBGMs, specifically their inability to scale to higher resolutions and occasional training instability. They propose new theoretical analyses and techniques for selecting noise scales , an efficient method for incorporating noise information , and a process for configuring annealed Langevin dynamics , all of which are crucial...

May 27, 2025•20 min

Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator

This paper introduces Disagreement-Aware Confidence Alignment (DACA) , an unsupervised method for calibrating the confidence of post-trained large language models (PoLMs) . While pre-trained language models (PLMs) are typically well-calibrated, post-training can lead to over-confidence , especially with limited labeled data. DACA addresses this by leveraging the well-calibrated confidence of PLMs on unlabeled data, specifically by optimizing calibration parameters only on examples where PLM and ...

May 27, 2025•14 min

AlphaEvolve: A coding agent for scientific and algorithmic discovery

This paper introduces AlphaEvolve, a system designed to automate the discovery of advanced algorithms by leveraging large language models (LLMs) within an evolutionary framework. The system works by taking a user-defined problem and evaluation criteria, then iteratively generating and improving code solutions through an evolutionary process powered by LLM ensembles. AlphaEvolve has successfully applied this method to solve complex open problems in areas like matrix multiplication and various fie...

May 27, 2025•24 min

Harnessing the Universal Geometry of Embeddings

This academic paper presents vec2vec , a novel method for translating text embeddings between different models without requiring paired data or prior knowledge of the encoders. The authors demonstrate that this unsupervised technique successfully aligns embeddings from various models into a universal latent space , preserving the geometric structure and semantics of the original data. They show that these translated embeddings can then be used to extract sensitive information from documents, eve...

May 27, 2025•23 min

Goal Inference using Reward-Producing Programs in a Novel Physics Environment

This academic paper investigates how humans represent and infer goals , proposing that goals can be formalized as reward-producing programs . The researchers developed a physics-based game environment where participants created and demonstrated novel goals. By collecting natural language descriptions and formal scoring criteria , they analyzed the relationship between goal complexity, reward structure, and perceived difficulty . A proof-of-concept computational method is presented, demonstrating...

May 27, 2025•19 min

Trial-Error-Explain In-Context Learning for Personalized Text Generation

This paper introduce Trial-Error-Explain In-Context Learning (TICL) , a method for personalizing large language models (LLMs) to match individual user writing styles without requiring model fine-tuning. TICL expands the in-context learning prompt by adding model-generated negative examples and explanations that highlight discrepancies from the target style. Evaluations demonstrate that TICL significantly outperforms existing methods in generating text that stylistically aligns with authors, indi...

May 27, 2025•12 min

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

This research investigates how little training data is needed for Reinforcement Learning with Verifiable Reward ( RLVR ) to significantly boost the mathematical reasoning abilities of large language models (LLMs) . Surprisingly, the authors demonstrate that training on even just one carefully chosen example can achieve performance comparable to using datasets containing thousands, resulting in substantial improvements on mathematical benchmarks. They explore the phenomena observed with such limi...

May 27, 2025•13 min

Test-Time Reinforcement Learning (TTRL)

This paper introduces Test-Time Reinforcement Learning (TTRL) , a novel method enabling Large Language Models (LLMs) to improve performance on unlabeled test data using Reinforcement Learning (RL). TTRL overcomes the lack of ground-truth labels by employing majority voting on multiple model outputs to estimate rewards , essentially allowing models to self-supervise their training. The research demonstrates that this approach leads to significant performance gains across various reasoning tasks a...

May 27, 2025•18 min

Interpreting Emergent Planning in Model-Free Reinforcement Learning

This paper presents research exploring whether a model-free reinforcement learning agent, specifically a DRC agent playing the game Sokoban , learns to plan. Through a concept-based interpretability methodology involving probing for planning-relevant concepts like future agent and box movements, investigating how plans are formed internally, and verifying the causal link between internal representations and behavior through interventions, the authors provide mechanistic evidence of emergent plan...

May 26, 2025•15 min

Agentic Reward Modeling_Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

This paper proposes a new reward system for large language models (LLMs) called agentic reward modeling, which aims to create more reliable rewards by integrating human preferences with verifiable correctness signals . An empirical implementation, named REWARDAGENT , is presented, which combines human preference rewards with signals related to factuality and instruction following . Extensive experiments show that REWARDAGENT outperforms traditional reward models on benchmarks and in practical ap...

May 26, 2025•13 min

Beyond Reward Hacking: Causal Rewards for Large LanguageModel Alignment

This research introduces a novel method for aligning large language models (LLMs) with human preferences while avoiding common pitfalls like reward hacking and spurious correlations . The authors propose a causal reward modeling approach that integrates causal inference and counterfactual invariance to ensure that reward predictions are based on true relationships rather than irrelevant data patterns. Through experiments on various datasets, including those focused on sycophancy, length, concept...

May 26, 2025•13 min

Learning How Hard to Think: Input-Adaptive Allocation of LM Computation

This paper introduces an approach to optimize the computational resources used by language models (LMs) when responding to different queries. Instead of applying the same level of processing to every request, the method learns to predict how much a query would benefit from more intensive computation and then allocates resources adaptively . This is achieved by training a model to estimate the potential improvement in output quality (marginal reward) for a given input and computation budget. The ...

May 26, 2025•20 min

Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

This paper introduces COCO-FACET , a new benchmark dataset designed to evaluate text-to-image retrieval models on attribute-focused queries , which differ from traditional general image caption queries. The researchers demonstrate that existing models, including CLIP-like and MLLM-based models, struggle with these specific attributes , especially those less prominent in images or less explored in training data like time and weather. To address this, they propose using promptable image embeddings...

May 26, 2025•16 min

UFT: Unifying Supervised and Reinforcement Fine-Tuning

This paper introduces Unified Fine-Tuning (UFT) , a novel method for enhancing the reasoning capabilities of large language models (LLMs) by integrating supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) . The authors argue that traditional SFT and RFT have limitations, with SFT potentially overfitting and RFT being constrained by the base model's initial capacity. UFT addresses these issues by blending memorization through supervised signals (hints) with exploration through reinfo...

May 26, 2025•14 min

Understanding High-Dimensional Bayesian Optimization

This paper investigates challenges in high-dimensional Bayesian optimization (HDBO) , particularly focusing on vanishing gradients that hinder the effectiveness of standard methods in high dimensions. It highlights how initialization schemes and acquisition function optimization contribute to these issues, and explains why recent methods succeed by promoting local search behaviors and mitigating vanishing gradients. The authors propose a simple maximum likelihood estimation (MLE) based approach ...

May 26, 2025•20 min

Inference time alignment in continuous space

This academic paper introduces Simple Energy Adaptation (SEA) , a novel method for aligning large language models (LLMs) with human preferences during the inference phase. Unlike traditional methods that rely on discrete searches within a limited set of responses from the base model, SEA formulates alignment as an iterative optimization process in a continuous latent space. By applying gradient-based Langevin Dynamics to the continuous output logits, guided by an energy function derived from the...

May 25, 2025•16 min

Efficient Test-Time Scaling via Self-Calibration

This academic paper explores methods to improve the efficiency and accuracy of Large Language Models (LLMs) during the final step of generating responses, known as test-time scaling. The authors propose Self-Calibration , a technique to teach LLMs to reliably estimate their own confidence in an answer with a single pass. By incorporating these calibrated confidence scores, they develop efficient test-time scaling strategies , such as stopping repeated sampling early when a confident answer is fo...

May 25, 2025•24 min

Conformal Prediction via Bayesian Quadrature

This paper explores a novel perspective on conformal prediction , a method for providing performance guarantees for machine learning models without assuming a specific data distribution. The authors propose viewing conformal prediction through a Bayesian lens , specifically utilizing Bayesian quadrature , a technique for estimating integrals with uncertainty. They argue that this approach addresses limitations of traditional frequentist-based conformal prediction, offering more interpretable gua...

May 25, 2025•23 min

Predicting from Strings: Language Model Embeddings for Bayesian Optimization

This paper, a research paper from Google DeepMind, introduces a novel approach called Embed-then-Regress for Bayesian Optimization . This method leverages the ability of language models to embed string representations of various types of inputs, including synthetic, combinatorial, and hyperparameter configurations, into fixed-length vectors. These vectors then serve as features for a Transformer-based regressor trained using in-context learning. The paper demonstrates that this approach achieves...

May 25, 2025•27 min

Self-Evolving Curriculum for LLM Reasoning

This document presents Self-Evolving Curriculum (SEC) , a novel method for reinforcement learning (RL) fine-tuning of large language models (LLMs) to enhance their reasoning capabilities. SEC frames curriculum selection as a non-stationary Multi-Armed Bandit (MAB) problem , where problem categories represent individual "arms". It learns a curriculum policy concurrently with LLM training , utilizing the absolute advantage from policy gradient methods as a metric for learning gain to dynamically a...

May 25, 2025•15 min

Online Decision-Focused Learning in Dynamic Environments

This paper introduces online decision-focused learning (DFL) , a framework for training predictive models used in dynamic, sequential decision-making tasks where data distributions and objectives change over time. The authors propose a new algorithm, Decision-Focused Online Gradient Descent (DF-OGD) , which handles the non-differentiable and non-convex nature of the decision objective by regularizing the problem and using an optimistic approach with perturbations. Theoretical dynamic regret boun...

May 25, 2025•21 min

FisherSFT: Data-Efficient Supervised Fine-Tuning of Language Models Using Information Gain

This paper introduces FisherSFT , a method for making supervised fine-tuning (SFT) of large language models (LLMs) more data-efficient by selecting the most informative training examples . The key concept is to choose examples that maximize information gain , which is approximated by evaluating the Hessian of the LLM's log-likelihood . This approach uses a computationally efficient approximation based on linearizing the LLM's last layer and employs a greedy algorithm to select sentences with the...

May 25, 2025•14 min

Reward Shaping from Confounded Offline Data

This academic paper explores a novel technique for automatic reward shaping in reinforcement learning , specifically addressing the challenge of learning from offline data that may contain unobserved confounding factors . The authors propose using causal state value upper bounds , derived from this confounded data, as potential functions for Potential-Based Reward Shaping (PBRS). They demonstrate theoretically and through simulations that their method, when applied to a model-free learner like Q...

May 25, 2025•21 min

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

This academic paper introduces Trajectory Bellman Residual Minimization (TBRM) , a novel value-based reinforcement learning algorithm designed to enhance the reasoning capabilities of large language models (LLMs), particularly in mathematical problem-solving. Unlike prevailing policy-based methods like PPO and GRPO, TBRM streamlines the training process by eliminating the need for critics, importance sampling, or clipping mechanisms , requiring only a single rollout per prompt . The authors pres...

May 25, 2025•18 min

Understanding Best-of-N Language Model Alignment

May 25, 2025•14 min

Maximizing Acquisition Functions for Bayesian Optimization - and its relation to Gradient Descent

This academic paper explores methods to improve Bayesian optimization (BO) , a process for finding optimal settings for complex, costly functions. The authors address the challenge of maximizing acquisition functions , which are heuristics guiding BO's search and are often difficult to optimize, especially when evaluating multiple points simultaneously. They demonstrate that Monte Carlo integration allows for gradient-based optimization of a wide range of acquisition functions. Furthermore, they...

May 24, 2025•19 min

Bayesian Prompt Ensembles: Model Uncertainty Estimation for Black-Box Large Language Models

This paper introduces Bayesian Prompt Ensembles (BayesPE) , a novel method for quantifying uncertainty in black-box large language models (LLMs) without requiring access to their internal parameters or retraining. BayesPE achieves this by ensembling the outputs of an LLM prompted with various semantically equivalent instructions, learning the optimal weighting for each prompt through approximate Bayesian variational inference on a small validation dataset. The paper demonstrates that this approa...

May 24, 2025•17 min

← Prev Next →

Hosted on Spotify for Creators (Anchor)

For the best experience, listen in Metacast app for iOS or Android