In Case You Missed It: ARC 'Challenge' Is Not That Challenging - podcast episode cover

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

Dec 26, 2024‱24 min‱Ep. 286
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

đŸ€— Upvotes: 8 | cs.CL, cs.AI

Authors:
Ɓukasz Borchmann

Title:
In Case You Missed It: ARC 'Challenge' Is Not That Challenging

Arxiv:
http://arxiv.org/abs/2412.17758v1

Abstract:
ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.

For the best experience, listen in Metacast app for iOS or Android