Distribution-calibrated inference time compute for thinking llm-as-a-judge

Best AI papers explained

Dec 11, 2025•12 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper discusses the Distribution-Calibrated Aggregation scheme designed to improve the reliability of "Thinking-LLM-as-a-Judge" systems, which are often used for evaluating generative AI outputs. The core problem addressed is that simply aggregating multiple, noisy individual judgments (e.g., via majority vote) is suboptimal, especially when the judge is allowed to declare a tie. The proposed method utilizes Inference-Time Compute (ITC) to generate multiple independent samples and then models the three-way preference outcomes (A preferred, B preferred, or Tie) using a Bradley–Terry–Davidson formulation that accounts for both the margin of preference and the decisiveness of the vote (non-tie rate). Extensive experiments across machine translation and reward model benchmarks demonstrate that this distribution-aware aggregation consistently reduces the Mean Absolute Error (MAE) and increases accuracy, frequently matching or exceeding individual human rater performance. The authors emphasize that this calibration step is crucial for turning stochastic, individual LLM judgments into robust and accurate final ratings.

For the best experience, listen in Metacast app for iOS or Android