This academic paper, arXiv:2503.05070 , introduces PromptPex , a tool designed to automatically generate and evaluate unit tests for language model prompts . The authors highlight that prompts function similarly to traditional software but require new testing methods due to their dependency on the specific AI model interpreting them. PromptPex extracts specifications from a prompt to create varied and targeted tests, which are valuable for identifying regressions and understanding model behavior...
Jun 08, 2025•12 min
Jonathan Richens, David Abel, Alexis Bellot and Tom Everitt This paper focuses on the necessity of world models for creating general and capable AI agents , specifically those that can generalize to multi-step goal-directed tasks. The authors formally demonstrate that any agent capable of this type of generalization must have learned a predictive model of its environment , and that the accuracy of this learned model is directly tied to the agent's performance and the complexity of the goals it c...
Jun 08, 2025•15 min
This paper examines the reasoning capabilities of Large Reasoning Models (LRMs) compared to standard Large Language Models (LLMs) by testing them on controlled puzzle environments . The researchers found that LRM performance collapses entirely beyond a certain complexity, and surprisingly, their reasoning effort decreases as problems become too difficult. The study reveals three complexity regimes : standard LLMs perform better on low complexity, LRMs are advantageous at medium complexity, and b...
Jun 07, 2025•13 min
This excerpt from a handbook chapter explores the evolving landscape of decision-making in the information age, highlighting the increasing collaboration between humans and algorithms . It outlines a three-stage model of human decision processes when unaided and discusses how bounded rationality leads to the use of heuristics and intuitive judgments when resources are limited. The text further categorizes algorithmic collaboration into informing, recommending, and deciding, providing examples of...
Jun 07, 2025•59 min
This paper presents a causal framework for supervised domain adaptation , addressing how models can effectively generalize from source domains with abundant data to a target domain with limited examples. The authors propose structure-informed procedures that utilize knowledge of the underlying causal structure and domain discrepancies to transport inferences , achieving faster adaptation rates than traditional methods. They also introduce structure-agnostic algorithms that perform nearly as well...
Jun 06, 2025•44 min
This academic paper proposes **Conformal Arbitrage (CA)**, a post-deployment framework for **balancing competing objectives** in language models, such as helpfulness versus harmlessness or cost versus accuracy. CA uses a **data-driven threshold** calibrated with conformal risk control to decide when to use a potentially faster or cheaper "Primary" model optimized for a primary goal and when to defer to a more cautious "Guardian" model or human expert aligned with a safety objective. This approac...
Jun 06, 2025•22 min
This paper introduces a simulation-based method for statistical inference in adaptive experiments , specifically addressing challenges that arise when analyzing data from multi-arm bandit designs. Unlike traditional randomized trials, adaptive designs modify treatment assignments during the experiment, which can complicate standard inference techniques. The proposed approach, called simulation with optimism , generates artificial experiment trajectories under a null hypothesis by adding a slight...
Jun 06, 2025•49 min
This position paper explores the evolution of Large Language Models into autonomous agents, proposing a unified theory that views both internal reasoning and external actions as equivalent tools for acquiring knowledge. The authors argue that for optimal behavior, an agent's decision boundary for using tools should align with its knowledge boundary , only resorting to external tools when internal knowledge is insufficient. They discuss how this alignment can be achieved through various training ...
Jun 06, 2025•22 min
This paper introduces quantitative LLM judges , a new approach for evaluating the output of large language models (LLMs) that aims to improve upon the "LLM-as-a-judge" framework. The core idea is to decouple the qualitative reasoning provided by an LLM judge (its textual evaluation) from the quantitative scoring . The framework utilizes a two-stage process where a frozen LLM provides a textual evaluation and initial score, and then a separate, lightweight model (like a generalized linear model) ...
Jun 06, 2025•18 min
This paper describes the Self-Challenging framework , a method for training large language model (LLM) agents to use tools by generating their own training tasks. The framework involves the agent acting as a "challenger" to create tasks and then as an "executor" to solve them using reinforcement learning. To ensure task quality, the paper introduces the "Code-as-Task" (CaT) formalism , where tasks are defined by an instruction, a verifiable code function, an example solution, and failure cases. ...
Jun 06, 2025•15 min
This paper introduces In-Context Pure Exploration (ICPE) , a novel deep learning framework designed to learn exploration strategies for active sequential hypothesis testing . Unlike traditional methods that rely on explicit problem-specific algorithms, ICPE uses a Transformer architecture to infer the underlying problem structure directly from experience. The framework combines supervised and reinforcement learning , enabling agents to efficiently discover effective sampling techniques for ident...
Jun 06, 2025•30 min
This document investigates why bidirectional language models perform better than unidirectional models on natural language understanding tasks. The authors propose a new framework called Flow Neural Information Bottleneck (FlowNIB) , which uses the Information Bottleneck principle to analyze the flow of information during training. FlowNIB dynamically balances maximizing information about the input and information relevant to the output. The study shows that bidirectional models preserve more mu...
Jun 06, 2025•19 min
This academic paper examines the faithfulness of chain-of-thought (CoT) reasoning in large language and vision-language models, specifically looking at how different types of biases affect model behavior and whether these biases are reflected in the models' CoTs. The research introduces a novel evaluation framework to analyze bias articulation and identifies a phenomenon of "inconsistent reasoning" where models show correct initial steps but ultimately change their answer based on a bias. A key ...
Jun 05, 2025•17 min
This academic paper introduces FIBO , a novel approach to Bayesian optimization (BO) that significantly streamlines the process . Traditional BO relies on sequentially building surrogate models and optimizing acquisition functions, which can be computationally expensive and time-consuming . FIBO bypasses these steps by employing a pretrained deep generative model that directly samples from the posterior distribution of the optimal point , achieving faster computation times without sacrificing op...
Jun 05, 2025•24 min
This paper from Google Research investigates the ability of large language models (LLMs) to perform probabilistic reasoning in interactive settings, specifically focusing on their capacity to infer user preferences over multiple interactions. The research finds that off-the-shelf LLMs struggle with this task compared to an optimal Bayesian model , demonstrating limited improvement as more information becomes available. To address this, the study introduces Bayesian teaching , a method where LLMs...
Jun 05, 2025•16 min
This paper details an innovative method for improving vision-language models (VLMs) by leveraging large language models (LLMs) to optimize the text prompts used in tasks like image classification. Current methods for prompt learning in VLMs can suffer from issues like lack of interpretability and overfitting. The proposed approach, termed Interpretable Prompt Optimization (IPO) , uses an LLM as a parameter-free optimizer that iteratively refines prompts based on performance feedback and historic...
Jun 05, 2025•14 min
This paper introduces an innovative framework using an evolutionary algorithm to optimize prompts for vision-language models without requiring additional training. The method evolves prompts through iteration and selection to elicit complex multimodal reasoning abilities , such as breaking down tasks and employing external tools like Python interpreters for image manipulation. Experimental results demonstrate that this evolutionary prompt optimization , especially when coupled with tool usage, s...
Jun 05, 2025•19 min
This paper examines a fundamental limitation in evaluating large language models (LLMs): current methods primarily assess only their observable outputs, neglecting a potentially vast amount of unseen knowledge embedded within them. To address this, a research paper introduces KnowSum , a statistical framework that estimates this hidden knowledge by extrapolating from the frequency of observed outputs, drawing parallels to ecological and linguistic methods for estimating unseen species or words. ...
Jun 04, 2025•14 min
This document introduces CFGRL , a novel framework that bridges generative modeling, specifically diffusion guidance, and reinforcement learning. The core idea is to treat policy improvement as guiding a diffusion model, allowing for simple training akin to supervised learning while still enabling performance beyond the initial dataset . CFGRL can improve policies by combining a reference policy with an "optimality" distribution, and crucially, the degree of this improvement can be controlled du...
Jun 02, 2025•17 min
This academic paper presents Alita , a novel generalist agent designed to enhance scalable agentic reasoning with a focus on minimal predefinition and maximal self-evolution . Unlike conventional agents that rely heavily on pre-designed tools and workflows, Alita utilizes a radically simple design with a core web agent and the ability to autonomously generate, refine, and reuse capabilities via Model Context Protocols (MCPs) . The paper highlights Alita's superior performance on benchmarks like ...
Jun 02, 2025•15 min
This academic paper proposes a local data attribution framework for online reinforcement learning (RL) . The framework uses influence functions to identify which training data records negatively impact the RL agent's learning within each training round. By filtering out these harmful records , the proposed method, called Influence-guided Intervention and Filtering (IIF) , demonstrates improved performance and sample efficiency in standard RL tasks and also shows promise in reducing toxicity in R...
Jun 02, 2025•26 min
This paper presents a theoretical analysis of how transformers can learn k-fold composition tasks , which involve combining multiple permutations. It proposes that transformers can achieve this through a hierarchical process , where each layer learns progressively more complex compositions, referred to as " hops ." The document details a curriculum learning strategy (Algorithm 1) and a mixed training approach (Algorithm 2), demonstrating how transformers can learn these tasks efficiently. The an...
Jun 02, 2025•12 min
This academic paper introduces a new approach to preference learning by incorporating response time data alongside traditional binary choices. The authors highlight that while standard preference learning relies solely on which option a user prefers, the speed of the decision can provide valuable information about the strength of that preference. They propose novel methodologies, including a Neyman-orthogonal loss function , to leverage response time information based on the Evidence Accumulatio...
Jun 02, 2025•22 min
This research introduces A-PO *, a new reinforcement learning approach for refining large language models to enhance their reasoning capabilities. Unlike existing methods that are often computationally expensive and memory-intensive due to requiring multiple generations per prompt or explicit critic networks, A*-PO streamlines the process. It accomplishes this by initially estimating the optimal value function offline using samples from a reference policy, then performing on-policy updates with ...
May 31, 2025•23 min
The paper emphasizes the necessity of causal reasoning for reliable algorithmic decision-making (ADM). It explains that real-world decisions involve cause-and-effect relationships , making causal challenges inherent in ADM systems. To ensure reliability, ADM algorithms must incorporate explicit assumptions about the underlying causal structure . The text highlights the distinction between causal estimands (the true effects of decisions) and statistical estimands (what can be estimated from obser...
May 31, 2025•27 min
This academic paper explores how people attribute beliefs to others as a way of explaining their actions, focusing on the explanatory strength of a belief rather than just its probability. The authors developed a computational model that assesses this strength using three factors: accuracy, informativity, and causal relevance . Through an experiment where participants ranked belief statements describing a player's actions in a puzzle game, the research suggests that causal relevance is the stron...
May 31, 2025•11 min
This document presents a novel approach for estimating the similarity between Markov chains using only sampled data, without requiring full knowledge of their transition probabilities. The authors leverage the recent finding that bisimulation metrics , a tool for quantifying stochastic process similarity, are equivalent to optimal transport distances . They reformulate the problem as a linear program and propose a stochastic primal-dual optimization algorithm (SOMCOT) to solve it based on sample...
May 31, 2025•19 min
This research explores why Large Language Models (LLMs) struggle with tasks requiring global reasoning over long inputs. The authors propose that these limitations stem from constraints on information flow within LLMs, formalizing this with the Bounded Attention Prefix Oracle (BAPO) model. They classify problems as BAPO-easy or BAPO-hard , predicting that LLMs will fail on the latter. Empirical results with models like GPT-4o, Claude, and Gemini support this prediction, showing poor performance ...
May 31, 2025•18 min
This paper describes IDA-Bench , a new benchmark for evaluating Large Language Models (LLMs) as interactive data analysis agents . Unlike existing benchmarks that focus on single-turn interactions, IDA-Bench assesses LLMs in multi-round dialogues with a simulated user, mirroring the iterative and subjective nature of real-world data analysis. Tasks are derived from complex Kaggle notebooks and presented as sequential natural language instructions. Initial results indicate that even advanced LLMs...
May 31, 2025•23 min
This paper examines the performance of Prediction-Powered Inference (PPI++) , a statistical method combining labeled and unlabeled data for estimation. While previous work suggested PPI++ always improved over using labeled data alone asymptotically, this analysis provides a finite-sample "no free lunch" result . It demonstrates that PPI++ only outperforms classical methods if the correlation between pseudo-labels and true labels is above a specific threshold dependent on the labeled sample size....
May 31, 2025•16 min