The provided text explores whether multi-agent systems (MAS) can be effectively replaced by a single agent simulating complex workflows through multi-turn conversations. Research indicates that homogeneous workflows, where multiple agents use the same base model, can be replicated by one agent with significant computational efficiency gains via KV cache reuse. The authors introduce OneFlow, an automated algorithm that utilizes dual meta-LLMs and Monte Carlo Tree Search to design streamlined, hig...
Jan 24, 2026•17 min
This research explores Reinforcement Learning from Human Feedback (RLHF) under the KL-regularized contextual bandits framework. While traditional methods rely on complex optimistic or pessimistic estimates to manage uncertainty, the authors prove that greedy sampling—directly using empirical estimates—is surprisingly efficient. By leveraging the structural property that optimal policies remain within a bounded likelihood ratio of the reference policy, the study establishes logarithmic regret in ...
Jan 24, 2026•13 min
This research paper establishes a formal learning theoretic framework to analyze the performance of zero-shot prediction (ZSP) in multimodal models like CLIP. The authors decompose prediction error into three distinct components: prompt bias, which measures the suitability of a prompting strategy; residual dependence, which quantifies the information lost when using text as a proxy for image features; and estimation error from finite data. By avoiding common but unrealistic assumptions of condit...
Jan 24, 2026•15 min
This paper introduces TTT-Discover, an innovative system designed to solve complex science and engineering problems through test-time training. Unlike traditional static models, this approach enables an open-source AI to continuously learn and refine its policy while actively seeking solutions for a specific task. By utilizing an entropic objective and adaptive reinforcement learning, the system successfully established new state-of-the-art results in mathematics, GPU kernel engineering, and bio...
Jan 23, 2026•16 min
This paper explores how the statistical properties of pretraining data determine the success of in-context learning (ICL) in transformer models. By developing a theoretical framework that unifies task selection and generalization, the authors demonstrate that heavy-tailed pretraining distributions significantly enhance a model's robustness to distribution shifts. Conversely, while light-tailed distributions excel at familiar tasks, they require fewer examples to generalize effectively. The study...
Jan 23, 2026•19 min
This research paper addresses the challenge of anytime reasoning, where large language models (LLMs) must provide high-quality solutions under strict computational or token budgets. The authors introduce a novel evaluation metric called the Anytime Index, which measures how effectively a model’s solution quality improves as more reasoning tokens are generated. To enhance this efficiency, they propose Preference Data Prompting (PDP), an inference-time method where models learn from self-generated...
Jan 23, 2026•18 min
This paper propose using promptable image embeddings guided by questions generated by an LLM, which help Multimodal models focus on specific visual attributes. They also implement a linear approximation strategy to reduce the high computational costs associated with using multimodal large language models (MLLMs) for large-scale searches. Experimental results demonstrate that these techniques significantly improve retrieval precision on complex queries compared to traditional baseline methods. Ul...
Jan 20, 2026•14 min
This paper introduces Activation Reward Models (Activation RMs), a novel method for aligning Large Language Models (LLMs) and Multimodal Models with human preferences using minimal data. Unlike traditional reward models that require extensive fine-tuning, this approach utilizes activation steering to manipulate a model’s internal representations through just a few examples. By identifying and guiding specific attention heads, the system generates accurate reward signals and adapts rapidly to new...
Jan 20, 2026•16 min
Researchers have introduced In-Context Reinforcement Learning (ICRL), a novel prompting framework that enables large language models to self-improve during inference using only numerical scalar rewards. Unlike traditional methods that rely on verbal feedback or costly retraining, ICRL treats the model’s context window as a dynamic experience buffer, concatenating past attempts with their corresponding reward signals. As this context grows, the model demonstrates an emergent ability to optimize i...
Jan 19, 2026•11 min
This research paper provides a theoretical and empirical comparison between Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). The authors identify a performance gap between the two methods caused by model mis-specification, where the intended reward or policy cannot be perfectly captured by the chosen model classes. Their analysis reveals that RLHF maintains a structural advantage when policy models are limited, whereas DPO performs better when reward mo...
Jan 19, 2026•14 min
This paper discusses a paradigm shift in multi-agent reinforcement learning, moving away from the labor-intensive process of manual reward engineering. Instead of hand-crafting complex numerical functions, researchers propose using large language models (LLMs) to translate natural language objectives into executable code. This approach addresses traditional bottlenecks like credit assignment and environmental non-stationarity by leveraging the semantic understanding and zero-shot generalization ...
Jan 18, 2026•18 min
This paper introduces Process Reward Learning (PRL), a novel reinforcement learning framework designed to enhance the reasoning capabilities of Large Language Models (LLMs). Unlike traditional methods that rely on sparse "outcome rewards" given only at the end of a task, PRL derives dense, step-by-step supervision signals from a mathematically rigorous decomposition of the global objective. This approach eliminates the need for computationally expensive tools like Monte Carlo Tree Search or sepa...
Jan 18, 2026•16 min
This paper provides a theoretical and empirical analysis of **on-policy preference learning**, a method used to align large language models with human values. The authors introduce the **coverage improvement principle**, demonstrating that updating a model using its own generated data—rather than static, offline datasets—creates a feedback loop that makes subsequent data increasingly informative. This process allows **on-policy Direct Preference Optimization (DPO)** to achieve **exponentially fa...
Jan 17, 2026•15 min
This research paper establishes a formal connection between singular learning theory (SLT) and deep reinforcement learning (RL) to explain how agents evolve during training. The authors introduce a generalized Bayesian framework and a complexity metric called the local learning coefficient (LLC) to analyze the geometry of an agent's policy. Their findings demonstrate that RL training is characterized by stagewise development, where models undergo sudden Bayesian phase transitions between differe...
Jan 16, 2026•13 min
The researchers introduce Engram, a novel conditional memory module that enhances Large Language Models by integrating a scalable lookup mechanism for static knowledge. While modern models rely on Mixture-of-Experts (MoE) for sparse computation, Engram uses N-gram embeddings to retrieve formulaic or factual information in constant time. This architectural shift creates a U-shaped scaling law that balances neural processing with static memory, allowing the model to offload simple retrieval tasks ...
Jan 16, 2026•14 min
This research explores how to model **"latent actions"** in unpredictable, real-world videos where specific movement commands are not pre-defined. The authors compare three primary methods for organizing these hidden actions: **sparsity-based constraints**, **noise addition**, and **discrete quantization**. By testing these techniques on diverse datasets like **YouTube** and **robotics footage**, the study examines how much information these models should capture to be effective. Results indicat...
Jan 16, 2026•14 min
This paper introduces computational toolkit designed to correct errors in demand estimation when using unstructured data, such as images or text, to represent products. Because researchers often use machine learning embeddings as proxies for true product attributes, these approximations can introduce statistical bias that leads to inaccurate predictions of consumer behavior. The authors propose a bias-correction method and diagnostic tests to ensure that these proxies adequately capture the dime...
Jan 14, 2026•14 min
This paper introduces how we can adapt quickly to new tasks without updating model parameters using a framework called SPICE (Shaping Policies In-Context with Ensemble prior), a novel Bayesian In-Context Reinforcement Learning method.Unlike existing models that rely on optimal data, SPICE utilizes a deep ensemble to learn a value prior from suboptimal trajectories and refines this prior at test-time through Bayesian updates. This approach effectively addresses the behavior-policy bias found in t...
Jan 14, 2026•12 min
This research explores Digital Red Queen (DRQ), a self-play algorithm that uses large language models to evolve assembly programs for the game Core War. In this competitive environment, digital "warriors" battle for control of a virtual machine by attempting to crash their opponents' processes. The DRQ framework moves beyond static optimization by forcing models to continually adapt against a growing history of previous champions, mimicking the evolutionary arms races found in biological systems...
Jan 14, 2026•14 min
The researchers introduce DroPE, a novel method for extending the context length of large language models by removing positional embeddings after pretraining. While explicit positional information like RoPE is essential for fast training convergence, it creates a "bottleneck" that prevents models from processing sequences longer than those seen during training. The authors demonstrate that these embeddings act as a temporary scaffold that can be discarded and replaced with a brief recalibration ...
Jan 13, 2026•12 min
This paper introduces representation-based exploration, a method designed to help language models discover novel behaviors rather than just refining existing ones through reinforcement learning. The researchers propose using elliptical bonuses derived from a model's internal hidden states to explicitly reward diversity and novelty during both inference and training. Their experiments demonstrate that this approach significantly improves verifier efficiency and pass@k rates across complex reasoni...
Jan 12, 2026•14 min
This paper introduces NextFlow, an advanced autoregressive model designed for high-quality image generation and editing. It utilizes a decoder-only Transformer architecture and a multi-scale training approach to enhance visual fidelity and reconstruction accuracy. To support this technology, the authors present EditCanvas, a comprehensive benchmark containing over 5,000 human-verified samples across 57 distinct tasks. This dataset evaluates diverse capabilities, ranging from traditional image mo...
Jan 10, 2026•15 min
This paper discusses **RelayLLM**, a framework designed to improve the efficiency of complex reasoning by enabling **token-level collaboration** between small and large language models. Unlike traditional routers that offload entire queries, the **Small Language Model (SLM)** serves as an active controller that generates a special command to "relay" specific, difficult reasoning steps to a **Large Language Model (LLM)**. The system is trained using a two-stage process involving a **supervised wa...
Jan 10, 2026•13 min
Researchers from institutions like Carnegie Mellon and Stanford propose a unified definition of hallucination in large language models by framing it as a failure of internal world modeling. Traditionally, the term referred to scattered issues like translation errors or unverified summaries, but this framework suggests that all hallucinations are simply mismatches between model outputs and a reference truth. By defining a Reference World Model, researchers can specify exactly what counts as "true...
Jan 08, 2026•12 min
This research introduces the concept of geometric memory to explain how deep sequence models store and reason over atomic facts. Unlike traditional associative memory, which functions as a simple lookup table for co-occurring entities, geometric memory synthesizes global relationships that enable models to solve complex multi-hop reasoning tasks. The authors demonstrate that models can learn to navigate large, unseen graphs by organizing node embeddings into a spatial geometry that reflects the ...
Jan 08, 2026•13 min
Modern AI theory often struggles to explain why certain datasets enable better out-of-distribution generalization than others, as classical information theory fails to account for computational constraints. This research introduces epiplexity, a new metric that quantifies the structural, learnable information an observer can extract within a limited time budget. Unlike standard entropy, which treats random noise and complex patterns similarly, epiplexity distinguishes reusable structure from inh...
Jan 08, 2026•14 min
This paper establishes a theoretical framework for diffusion language models (DLMs), positioning them as mathematically optimal parallel samplers compared to sequential autoregressive models. By using circuit complexity as a benchmark, the authors prove that DLMs can generate complex distributions in the minimum number of sequential steps when paired with chain-of-thought reasoning. The research highlights that advanced inference techniques like remasking and revision are essential for minimizin...
Jan 07, 2026•12 min
This paper introduces the Universal Reasoning Model (URM), a new architecture designed to solve highly complex logic puzzles like ARC-AGI and Sudoku. Researchers found that the success of Universal Transformers in reasoning tasks is driven by their recurrent inductive bias and non-linear depth, rather than overly complex designs. To build on this, the URM incorporates a ConvSwiGLU module to improve local token interactions and a truncated backpropagation method to stabilize training. These innov...
Jan 06, 2026•14 min
This paper introduces Recursive Language Models (RLMs), a novel inference strategy designed to overcome the limitations of context windows and the performance degradation of standard large language models. Unlike traditional approaches that feed long prompts directly into a neural network, an RLM treats the input as an external environment within a Python REPL. This allows the model to use code to programmatically examine, decompose, and filter massive datasets that would otherwise exceed its me...
Jan 06, 2026•16 min
This paper introduces a novel causal framework designed to improve machine learning generalization across different data domains. It specifically presents Circuit-TR and Circuit-AD, two algorithms that leverage causal transportability theory to enable zero-shot or few-shot learning by identifying shared "modules" or mechanisms between source and target environments. While traditional methods rely on statistical invariance, this research focuses on compositional structure, allowing the system to ...
Jan 04, 2026•15 min