We discuss Tina , a family of efficient reasoning models achieved by applying Low-Rank Adaptation (LoRA) during reinforcement learning to a small 1.5B parameter language model. This approach demonstrates that strong reasoning performance, competitive with larger models, can be attained with significantly reduced computational costs. The authors explore the effectiveness of this minimalist strategy across various reasoning tasks and ablation studies, hypothesizing that LoRA facilitates rapid adap...
Apr 25, 2025•16 min
This research article explores the capacity of large language models (LLMs) to understand "theory of mind" (ToM), the human ability to attribute mental states to others. The author, Michal Kosinski, evaluated eleven LLMs using false-belief tasks, a standard method for assessing ToM in humans. The study's findings indicate a progression in LLM performance, with the most advanced model, ChatGPT-4, demonstrating a level of success comparable to that of a six-year-old child. The article discusses th...
Apr 25, 2025•15 min
QUEST: Quality-Aware Metropolis-Hastings Sampling for Machine Translation addresses the challenge of generating both high-quality and diverse translations in machine translation. The authors note that relying solely on the machine translation model's likelihood for quality assessment is often unreliable. To overcome this, they propose using quality evaluation metrics within a Gibbs distribution and employing the Metropolis-Hastings algorithm to sample multiple translations from high-quality regi...
Apr 24, 2025•10 min
This paper explores efficient ways to learn optimal decision-making policies from offline data by incorporating human preferences , addressing scenarios where direct interaction with the environment or a predefined reward function is impractical. It bridges the gap between offline reinforcement learning and preference-based reinforcement learning , focusing on minimizing the number of human queries needed. The authors propose a novel algorithm, Sim-OPRL, which leverages a learned environment mod...
Apr 24, 2025•17 min
This research paper investigates how to improve the reasoning capabilities of large language models (LLMs) , specifically focusing on causal reasoning through counterfactual questions . The authors propose new metrics to better evaluate this reasoning ability and introduce fine-tuning methods that utilize counterfactual feedback to enhance it. Their work also categorizes different ways reasoning can generalize to new problems and evaluates the effectiveness of their fine-tuning approaches across...
Apr 24, 2025•21 min
This paper introduces Generative Active Task Elicitation (GATE) , a new framework where language models interact with users through open-ended questions to understand their preferences for specific tasks. This method aims to overcome the challenges of specifying complex preferences using traditional prompts or examples. The authors demonstrate through experiments in content recommendation, moral reasoning, and email validation that GATE can elicit more informative preference specifications with ...
Apr 24, 2025•12 min
The paper introduces a novel approach called Sub-optimal Data Pre-training (SDP) to enhance the efficiency of human-in-the-loop reinforcement learning (RL) . SDP utilizes readily available, low-quality data by assigning them minimal reward labels, allowing the reward model to learn basic distinctions before human feedback is even introduced. This pre-training aims to significantly reduce the amount of human interaction needed to train effective RL agents across various tasks. The authors present...
Apr 24, 2025•22 min
This paper introduces γ-Bench , a novel framework for evaluating the gaming ability of large language models (LLMs) in complex, multi-agent environments. It includes eight classical game theory scenarios with dynamic scoring and parameters to assess LLMs' robustness, generalizability, and strategic thinking. The study evaluates thirteen LLMs from six model families , revealing that Gemini-1.5-Pro currently achieves the top performance . The research also explores the impact of prompt engineering...
Apr 24, 2025•24 min
This paper introduces tool learning , where large language models utilize external tools to enhance their capabilities in complex tasks. A key challenge in this area is the quality of tool documentation , which often suffers from incompleteness, redundancy, or inaccuracies. To address this, the paper proposes DRAFT , a self-driven iterative framework that enables LLMs to automatically improve tool documentation through exploration and feedback. This framework includes experience gathering , lear...
Apr 24, 2025•13 min
This paper examines how AI can best assist human experts in decision-making by moving beyond single predictions to providing sets of likely possibilities. It highlights the limitations of traditional approaches focused on AI transparency and the potential of prediction sets to enhance human-AI accuracy. The research introduces the concept of optimal prediction sets tailored to human error patterns, demonstrating that statistically guaranteed sets (like those from conformal prediction) are not al...
Apr 24, 2025•15 min
This paper explores methods for enhancing the self-correction abilities of large language models (LLMs), which is currently a challenging area. The authors introduce SCoRe, a novel multi-turn reinforcement learning approach that trains a single LLM to identify and rectify its own errors using only self-generated data. This method addresses limitations of prior techniques, such as reliance on multiple models or external supervision, and tackles issues like distribution mismatch and behavioral col...
Apr 24, 2025•13 min
This research addresses the difficulty of computing stable outcomes in multi-agent reinforcement learning by incorporating principles from behavioral economics. The authors introduce risk aversion and bounded rationality into game theory, leading to a new solution concept called risk-averse quantal response equilibrium (RQE). They demonstrate that RQE can be computationally tractable in various game settings, unlike traditional Nash equilibria, and that this approach aligns with observed human b...
Apr 24, 2025•18 min
This research introduces a new method called Cascaded Selective Evaluation to improve the reliability of using large language models (LLMs) as judges for evaluating text generation. This approach uses a confidence estimation technique called Simulated Annotators to determine when an LLM's judgment is likely to align with human preferences. By selectively trusting LLMs based on their confidence and escalating to stronger models only when needed, the framework provides a provable guarantee of huma...
Apr 24, 2025•11 min
This ICRL25 (Oral) paper introduces Iterative Nash Policy Optimization (INPO) , a novel online algorithm for aligning large language models with general human preferences, moving beyond the limitations of traditional reward-based Reinforcement Learning with Human Feedback (RLHF) methods that assume the Bradley-Terry model. INPO adopts a game-theoretic perspective , framing preference learning as a two-player game where the policy iteratively plays against itself using no-regret learning to appro...
Apr 24, 2025•20 min
"SycEval: Evaluating LLM Sycophancy," introduces a framework to assess the tendency of large language models to prioritize user agreement over factual accuracy, a behavior termed sycophancy . The study evaluated ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro using mathematics and medical advice datasets, finding that sycophantic responses were prevalent. The research further categorized this behavior into progressive sycophancy (leading to correct answers) and regressive sycophancy (leading to in...
Apr 23, 2025•16 min
We investigate Stack AI in vc pitch style, a company offering a no-code/low-code platform designed to simplify the development, deployment, and management of enterprise AI applications. The document details the problem of slow and complex AI adoption within businesses due to talent scarcity, cost, integration issues, and security concerns, and then presents Stack AI's solution as a user-friendly platform with robust security and flexible deployment options. It further includes a market analysis ...
Apr 22, 2025•23 min
We examine the crucial role of evaluation and benchmarks in today's genAI-based recommender systems.
Apr 22, 2025•30 min
We dive deep into the white paper "AI in the Enterprise," from OpenAI. This paper outlines seven key lessons for businesses adopting artificial intelligence . Drawing from the experiences of pioneering companies like Morgan Stanley, Indeed, Klarna, and Lowe's, the text highlights the importance of starting with evaluations, embedding AI into products, investing early, and customizing models . Furthermore, it emphasizes empowering domain experts with AI tools, streamlining developer workflows, an...
Apr 22, 2025•45 min
We discuss Nathan Lamber's recent post on the paper" "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?" ".This paper critically examines the impact of Reinforcement Learning with Verifiable Rewards (RLVR) on the reasoning capabilities of Large Language Models (LLMs) in tasks like math and coding. The authors surprisingly found that while RLVR improves the efficiency of sampling correct answers, it does not actually introduce new reasoning abiliti...
Apr 21, 2025•21 min
We explore research ideas focused on understanding human preferences within AI agent ecosystems that utilize standardized protocols like MCP and A2A. It explores three interconnected approaches: dynamically eliciting user preferences during task execution leveraging these protocols, eliciting user preferences regarding the agents' interaction styles when using these protocols, and inferring users' latent preferences from the interaction logs generated by protocol use. The research intends to use...
Apr 21, 2025•15 min
We discuss Cross-Environment Cooperation (CEC) , a novel reinforcement learning paradigm for training agents capable of zero-shot multi-agent coordination (ZSC) . Unlike prior work focusing on single-task training or enhancing partner diversity, CEC trains a single agent through self-play across a vast distribution of procedurally generated environments . The authors demonstrate that this approach fosters the development of general cooperative skills and norms , enabling effective collaboration ...
Apr 20, 2025•18 min
Along with Sutton's recent conversation with Hannah Fry, we review Sutton and Silver's recent paper on Era of Experience in AI, contrasting it with the current Era of Human Data where models learn primarily by imitating human-generated content. The authors argue that progress based solely on human data is plateauing, necessitating a shift towards agents learning from their own interactions with the environment . This new era will involve agents experiencing continuous streams of data, acting aut...
Apr 19, 2025•32 min
This research paper introduces QALIGN , a novel test-time method to enhance language model outputs by sampling from a more optimal distribution without requiring model retraining or even access to internal model details. Existing test-time compute methods that rely on reward models for selection can degrade with increased computation due to over-optimization of these imperfect proxies. QALIGN , leveraging Markov chain Monte Carlo techniques, refines outputs on a per-prompt basis as more computat...
Apr 19, 2025•16 min
We examine the claim that current AI agent advancements mirror pivotal historical technology shifts by analyzing Jeff Bezos's API mandate, Jeffrey Snover's Monad Manifesto, and Bill Gates's "Internet Tidal Wave" memo as benchmarks ( Sergey Shchegrikovich 's LinkedIn post). It then assesses the present state of AI agents, their limitations, and emerging standardization protocols like MCP and A2A. The analysis compares the maturity, ecosystem, strategic drivers, and risks of AI agents to these his...
Apr 19, 2025•15 min
This paper investigates reinforcement learning methods for fine-tuning large language models on complex reasoning tasks, particularly mathematical problems. The authors analyze GRPO, a successful but poorly understood algorithm, and surprisingly find that a simpler rejection sampling method, RAFT, achieves comparable results by training only on positively rewarded samples. Their analysis reveals that GRPO's effectiveness stems mainly from discarding prompts with entirely incorrect responses, lea...
Apr 19, 2025•13 min
We make a comprehensive security assessment of the Model Context Protocol (MCP), a proposed standard for connecting AI systems to external resources. It highlights the potential benefits of MCP in simplifying AI integration but primarily focuses on significant security vulnerabilities in its current design and typical implementations. The assessment details weaknesses such as tool manipulation, inadequate authentication and authorization, tool shadowing, and a lack of user visibility, emphasizin...
Apr 19, 2025•19 min
Bespoke Labs explored using reinforcement learning (RL) to enhance AI agents' ability to use multiple tools in sequence for complex tasks. They found that RL offered a more scalable approach compared to manual prompt engineering or supervised finetuning, which are limited by human-generated data. Their experiments using the GRPO algorithm significantly improved a language model's tool use performance on a benchmark requiring multi-step operations. Notably, their agent learned to orchestrate tool...
Apr 19, 2025•15 min
Large language models (LLMs) possess cultural knowledge that is not always apparent in multilingual interactions. This research reveals an "explicit–implicit localization gap," where LLMs perform better on culturally specific tasks when explicitly prompted with cultural context compared to when only the language of the prompt suggests the culture. The study demonstrates that providing explicit cultural cues enhances localization but can reduce response diversity and increase stereotypes. Convers...
Apr 19, 2025•12 min
This research investigates the impact of data filtering and repetition on large language model training. The authors found that repeating aggressively filtered datasets for multiple epochs, with adjustments to the training process like weight decay, can surpass the performance of training on much larger, less filtered datasets for a single epoch. They also explored the significance of individual documents within datasets, demonstrating that manipulating the counts of specific documents based on ...
Apr 18, 2025•19 min
This paper investigates discrepancies in scaling laws for compute-optimal language models, particularly between Kaplan et al. and Hoffmann et al. The authors reproduce the Kaplan et al. law and identify key factors causing the divergence: the computational cost of the last layer, the length of the learning rate warmup, and the importance of scale-dependent optimizer tuning. After correcting for these elements, the study achieves strong agreement with the Hoffmann et al. scaling law, notably demo...
Apr 18, 2025•17 min