Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

Daily Paper Cast

Feb 21, 2025•21 min•Ep. 590

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

🤗 Upvotes: 22 | cs.CL

Authors:
William Jurayj, Jeffrey Cheng, Benjamin Van Durme

Title:
Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

Arxiv:
http://arxiv.org/abs/2502.13962v1

Abstract:
Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

For the best experience, listen in Metacast app for iOS or Android