IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis

Best AI papers explained

May 31, 2025•23 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper describes IDA-Bench, a new benchmark for evaluating Large Language Models (LLMs) as interactive data analysis agents. Unlike existing benchmarks that focus on single-turn interactions, IDA-Bench assesses LLMs in multi-round dialogues with a simulated user, mirroring the iterative and subjective nature of real-world data analysis. Tasks are derived from complex Kaggle notebooks and presented as sequential natural language instructions. Initial results indicate that even advanced LLMs struggle with these multi-turn scenarios, highlighting the need to improve their instruction-following and reasoning capabilities for effective data analysis. The benchmark utilizes a sandbox environment for code execution and evaluates performance by comparing agent output to a human-derived baseline, with findings revealing different working styles and common failure modes among current LLM agents.

For the best experience, listen in Metacast app for iOS or Android