We review leaked Claude Sonnet 3.7 System Instructions, outlining guidelines for citing information obtained from tools like web search and internal document searches, emphasizing the use of antml:cite tags for specific claims. It also describes the criteria and formats for creating "artifacts," such as code, documents, and visualizations, for collaborative content creation, including how to read user-uploaded files and use specific libraries within the analysis tool. Furthermore, the document...
May 12, 2025•14 min
We describe a concept from a paper by Blackwell and Dubins concerning the merging of opinions or probability predictions between two individuals, Alex and Ben, as they observe increasing amounts of shared information . The central idea is that if their predictive models are updateable based on new evidence and they agree on what events are absolutely impossible , their predictions for future events will become increasingly similar over time, eventually converging. While their short-term predicti...
May 11, 2025•10 min
This paper presents HYRE , a method for quickly adapting large pretrained models to underspecified tasks like personalization or handling distribution shifts. It works by first training a single neural network that represents an ensemble of diverse models. At test time, using a small set of labeled examples from the target distribution, HYRE dynamically reweights the ensemble members based on their performance, selecting the combination of models best suited for the specific task without retrain...
May 11, 2025•21 min
This paper introduces Decomposed Reward Models (DRMs) , a novel method for understanding and aligning large language models with the diverse nature of human preferences. Instead of relying on a single reward score, DRMs represent preferences as vectors and utilize Principal Component Analysis (PCA) to identify distinct directional preference components from readily available binary comparison data. This approach enables the extraction of interpretable preference dimensions , such as helpfulness,...
May 11, 2025•17 min
Thiis paper introduces Active Statistical Inference , a novel approach for statistical inference that strategically utilizes a machine learning model to guide data collection under a labeling budget . By prioritizing the labeling of data points where the model is uncertain, this method aims to achieve more powerful inferences and smaller confidence intervals compared to traditional methods that collect data uniformly at random, even while using a black-box machine learning model and handling any...
May 10, 2025•16 min
This paper proposes a new method for optimizing the data mixtures used to train large language models (LLMs) . Traditional approaches often rely on costly trial and error or deterministic extrapolations that don't account for uncertainty, limiting their effectiveness and transferability. The authors introduce a multi-fidelity multi-scale Bayesian optimization framework , treating data curation as a sequential decision-making process where decisions about data mixture, model scale, and training d...
May 10, 2025•13 min
This document introduces a novel Bayesian statistical inference method that leverages Generative Artificial Intelligence (GAI) predictions . Instead of relying solely on limited observed data or traditional statistical models, the authors propose using GAI to create synthetic data , which then informs a non-parametric prior distribution within a Bayesian framework. This approach, termed AI-Powered Bayesian Inference , allows for robust uncertainty quantification and improved predictive inference...
May 10, 2025•18 min
This document presents a new method called CONFIDENCE-DRIVEN INFERENCE designed to improve the efficiency and accuracy of data annotation for tasks commonly found in computational social science . The core idea is to strategically combine large language model (LLM) annotations with a limited number of human annotations , guided by the LLM's expressed confidence levels . By prioritizing human input on examples where the LLM is less certain, this approach aims to reduce the overall need for expens...
May 09, 2025•21 min
This paper explores a new method for statistical inference in the age of AI, focusing on how predictions from large pre-trained models can serve as efficient surrogates for costly or difficult-to-obtain outcomes . Drawing a connection to the established field of surrogate outcome models in biostatistics and economics, the authors propose recalibrated prediction-powered inference (RePPI) . RePPI is presented as a more efficient approach than existing methods by learning an optimal "imputed loss" ...
May 09, 2025•20 min
This paper presents the Learn then Test (LTT) framework, a novel approach for calibrating machine learning models to provide explicit statistical guarantees on their predictions. The method works with any underlying model and data distribution without requiring retraining. LTT reframes the problem of controlling statistical errors, such as false discovery rate, intersection-over-union, and type-1 error , as a multiple hypothesis testing problem. By generating p-values for different model predict...
May 09, 2025•16 min
This paper introduces Preference Proxy Evaluations (PPE) , a novel benchmark designed to evaluate reward models for Reinforcement Learning from Human Feedback (RLHF) in large language models (LLMs). Unlike expensive end-to-end RLHF training, PPE utilizes proxy tasks to predict downstream LLM performance. These tasks include analyzing human preferences from a large dataset and assessing verifiable correctness preferences . The authors correlate these proxy metrics with real-world post-RLHF outcom...
May 09, 2025•15 min
This survey explores the increasing use of Large Language Models (LLMs) as evaluators, termed "LLMs-as-judges," across various fields due to their effectiveness and adaptability. It examines this paradigm from multiple angles, including their functionality (why they are used), methodology (how to implement them, such as single or multi-LLM systems and human-AI collaboration), applications across diverse domains (from general tasks like translation to specialized areas like legal and medical), an...
May 09, 2025•27 min
This paper proposes the Alternative Annotator Test (alt-test) , a novel statistical method for determining if a Large Language Model (LLM) can reliably substitute for human annotators in research tasks across various fields. The test involves comparing LLM annotations to those of a small group of human annotators on a subset of data to see if the LLM aligns better with the group than individual humans do. It also introduces the Average Advantage Probability , a measure for comparing the performa...
May 09, 2025•16 min
This paper examines the limitations of using large language models (LLMs) as judges for evaluating other models, particularly at the "evaluation frontier" where new models may be better than the judge. While using LLMs as judges is a promising approach for scalable evaluation due to the cost and bottleneck of human annotation, this method introduces biases that can distort model rankings . Researchers demonstrate that existing debiasing methods , even with a small set of high-quality labels, off...
May 09, 2025•12 min
This paper introduces Stratified Prediction-Powered Inference (StratPPI) , a new method for improving the statistical evaluation of models , particularly Large Language Models (LLMs), which often face costly human annotation bottlenecks. Building on Prediction-Powered Inference (PPI) , which combines small amounts of human-labeled data with larger, potentially biased automatic data, StratPPI utilizes data stratification strategies to significantly enhance the accuracy and confidence of model per...
May 09, 2025•13 min
This paper proposes Control Variates Evaluation , a method for efficiently evaluating large language models (LLMs) that reduces reliance on expensive human annotations . While synthetic feedback from other LLMs is cheaper, it introduces bias . This new approach combines human and synthetic feedback to achieve unbiased win-rate calculations with significantly fewer human annotations . Experiments demonstrate a considerable reduction in human annotations and show that fine-tuning synthetic evaluat...
May 09, 2025•21 min
This paper presents "Prediction-Powered Inference," a novel framework for conducting statistical inference while integrating predictions from machine learning systems with experimental data. The authors, including Anastasios N. Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I. Jordan, and Tijana Zrnic, propose algorithms that calculate provably valid confidence intervals for various statistical measures, like means and regression coefficients, without making assumptions about the specific...
May 09, 2025•11 min
This paper presents Gradient Variance Minimization (GVM) , a novel technique for optimizing Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs). The core idea is to dynamically allocate computational resources (sampling budget) across prompts based on their difficulty and gradient norms, aiming to minimize the variance of the stochastic gradient estimation. Unlike traditional methods that use uniform sampling, GVM-RAFT, an adaptation of the RAFT algorithm, employs a two-stage proces...
May 09, 2025•16 min
This academic paper proposes and evaluates Reasoning Reward Models (REASRMS) , a novel approach to training large language models (LLMs) to align with human preferences. The core idea is to formulate reward modeling not just as assigning a score but as a reasoning task where the model generates explicit justifications and evaluation rubrics for its preference judgments. The authors introduce RM-R1 , a family of REASRMS trained using a two-stage pipeline: distillation of high-quality reasoning ch...
May 09, 2025•20 min
This paper reexamines the traditional distinction between aleatoric and epistemic uncertainty in AI, arguing that this dichotomy is problematic and hinders practical application , especially with large language models. It presents conflicting definitions and empirical evidence suggesting these two types of uncertainty are intertwined rather than separate . The article advocates for a shift towards a more practical view of uncertainty based on identifying sources and defining uncertainty by the t...
May 08, 2025•17 min
This discussion from the Latent Space podcast with Cat Wu and Boris Cherny explores Claude Code , Anthropic's command-line interface tool for AI-assisted coding. They highlight Claude Code's Unix utility philosophy , prioritizing simplicity and composability for power users and automation workflows. The conversation touches on how Claude Code's design aligns with Anthropic's product principles , focusing on core model access rather than complex UI. They also address user experience consideration...
May 07, 2025•14 min
The sources propose a novel approach to artificial intelligence (AI) where agents achieve strategic competence, specifically reaching Nash equilibrium, without needing task-specific post-training or fine-tuning. This is hypothesized to occur through sophisticated reasoning abilities derived from extensive pre-training combined with Bayesian learning to adapt to new situations. This research aims to advance AI autonomy and generalization, offer greater efficiency in development, and potentially b...
May 07, 2025•28 min
We discuss a presentation and discussion on training language models (LMs) using distributed, or siloed, data , which is often proprietary and cannot be combined into a single dataset for joint training. The speaker highlights the importance of data for LM performance and the increasing trend of valuable data becoming proprietary, making traditional joint training approaches challenging. The presentation proposes a novel method, termed SILO Open LM , which adapts the Mixture-of-Experts (MoE) arc...
May 06, 2025•18 min
This paper introduces Advantage Alignment , a new family of algorithms designed to enhance the ability of artificial intelligence agents to navigate social dilemmas , situations where individual optimization leads to suboptimal collective outcomes. The research demonstrates that existing opponent shaping methods, like LOLA and LOQA, implicitly use Advantage Alignment. By aligning the "advantages" (benefits beyond the expected outcome) of competing agents and increasing the probability of mutuall...
May 06, 2025•16 min
This details a presentation by Geoffrey Irving , Chief Scientist at the UK AI Safety Institute, discussing approaches to achieving asymptotic safety guarantees for AI . Irving critiques existing methods like scalable oversight (including techniques like debate ), arguing that current theories and experiments suggest they will likely fail due to issues such as obfuscated arguments and exploration hacking . He proposes that while a full formal verification of neural networks is likely too difficul...
May 06, 2025•19 min
This paper challenges the traditional view that reward model accuracy is the sole determinant of success in Reinforcement Learning from Human Feedback (RLHF). It posits from an optimization perspective that while accuracy reflects alignment with ground truth, a critical factor often overlooked is reward variance , which influences the RLHF objective landscape. The authors demonstrate theoretically and empirically that low reward variance can lead to a flat optimization landscape , causing even h...
May 06, 2025•14 min
We summarize the presentation by Yoshua Bengio, a leading AI researcher, addresses the urgent need for AI safety measures in light of rapid advancements, particularly the development of superintelligent agents with the capability and potential intent to cause catastrophic harm. Bengio argues that while capability will continue to grow, focusing on preventing undesirable intentions in AIs is crucial, proposing a non-agentic "scientist AI" that understands the world without having its own goals, w...
May 06, 2025•15 min
We introduce Sparse Shift Autoencoders (SSAEs) , a novel method for learning to steer Large Language Models (LLMs) by manipulating their internal representations. Unlike traditional steering techniques that rely on expensive supervised data varying in single concepts, SSAEs are designed to learn from paired observations where multiple, unknown concepts change simultaneously . By mapping these embedding differences to sparse representations that correspond to individual concept shifts, SSAEs leve...
May 06, 2025•12 min
This academic paper, "You Are What You Eat - AI Alignment Requires Understanding How Data Shapes Structure and Generalisation," posits that achieving AI alignment necessitates understanding the relationship between the data distribution used for training and the resulting internal structure and generalization patterns of the model . The authors argue that traditional testing methods are insufficient because models with similar training performance can generalize differently based on their intern...
May 06, 2025•15 min
This paper, authored by researchers at Google DeepMind , investigates the impact of using large language models (LLMs) in various roles within information retrieval (IR) systems, specifically focusing on their use as rankers and judges for evaluating search results. The paper examines potential biases that can arise from LLMs interacting in these roles, including a bias observed in LLM judges favoring results from LLM rankers. Through experiments on standard IR datasets , the authors analyze the...
May 03, 2025•16 min