Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

TextArena

🤗 Upvotes: 21 | cs.CL, cs.AI, cs.LG, cs.MA Authors: Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, Cheston Tan Title: TextArena Arxiv: http://arxiv.org/abs/2504.11442v1 Abstract: TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via ...

Apr 17, 2025•22 min•Ep. 681

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

🤗 Upvotes: 172 | cs.CV Authors: Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Han Lv, Lijun Wu, Kaipeng Zhang, Huipeng De...

Apr 16, 2025•23 min•Ep. 680

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

🤗 Upvotes: 95 | cs.DC, cs.AI, 68T50, I.2.7; I.2.11 Authors: Zonghang Li, Tao Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu Title: PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters Arxiv: http://arxiv.org/abs/2504.08791v1 Abstract: Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existi...

Apr 16, 2025•24 min•Ep. 679

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

🤗 Upvotes: 36 | cs.LG, cs.AI Authors: Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen Title: VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning Arxiv: http://arxiv.org/abs/2504.08837v1 Abstract: Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on ...

Apr 16, 2025•21 min•Ep. 678

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

🤗 Upvotes: 35 | cs.CV Authors: Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang Title: FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding Arxiv: http://arxiv.org/abs/2504.09925v1 Abstract: We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM ...

Apr 16, 2025•20 min•Ep. 677

Iterative Self-Training for Code Generation via Reinforced Re-Ranking

🤗 Upvotes: 29 | cs.CL, cs.IR, cs.SE Authors: Nikita Sorokin, Ivan Sedykh, Valentin Malykh Title: Iterative Self-Training for Code Generation via Reinforced Re-Ranking Arxiv: http://arxiv.org/abs/2504.09643v1 Abstract: Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions...

Apr 16, 2025•20 min•Ep. 676

Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

🤗 Upvotes: 83 | cs.CV, cs.AI Authors: Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Zhiwu Qing, Fei Xiao, Meng Wei, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu Liu, Feng Ling, Heng Zhang, Houmin Wei, Huafeng Kuang, Jerry Duncan, Junda Zhang, Junru Zheng, Li Sun, M...

Apr 15, 2025•23 min•Ep. 675

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

🤗 Upvotes: 32 | cs.CV Authors: Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, Xihui Liu Title: GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation Arxiv: http://arxiv.org/abs/2504.08736v1 Abstract: In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tok...

Apr 15, 2025•19 min•Ep. 674

MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

🤗 Upvotes: 25 | cs.CV, cs.AI Authors: Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, Jiang Bian Title: MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft Arxiv: http://arxiv.org/abs/2504.08388v1 Abstract: World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real-time interactive world model on Minecraft, an open-ended sandbox g...

Apr 15, 2025•19 min•Ep. 673

Kimi-VL Technical Report

🤗 Upvotes: 71 | cs.CV Authors: Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, ...

Apr 12, 2025•23 min•Ep. 672

C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing

🤗 Upvotes: 37 | cs.LG Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou Title: C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing Arxiv: http://arxiv.org/abs/2504.07964v1 Abstract: Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways-our study reveals that naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement. Motivated by this observation, we develop ...

Apr 12, 2025•22 min•Ep. 671

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

🤗 Upvotes: 34 | cs.CV, cs.AI, cs.CL Authors: Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, Feng Zhao Title: VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning Arxiv: http://arxiv.org/abs/2504.07956v1 Abstract: The advancement of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). However, a rigorous evaluati...

Apr 12, 2025•22 min•Ep. 670

DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning

🤗 Upvotes: 33 | cs.CL Authors: Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Siva Reddy Title: DeepSeek-R1 Thoughtology: Let's about LLM Reasoning Arxiv: http://arxiv.org/abs/2504.07128v1 Abstract: Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs...

Apr 12, 2025•26 min•Ep. 669

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

🤗 Upvotes: 33 | cs.CV Authors: Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, Ming-Ming Cheng Title: VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning Arxiv: http://arxiv.org/abs/2504.07960v1 Abstract: Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide ...

Apr 12, 2025•20 min•Ep. 668

MM-IFEngine: Towards Multimodal Instruction Following

🤗 Upvotes: 26 | cs.CV Authors: Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang Title: MM-IFEngine: Towards Multimodal Instruction Following Arxiv: http://arxiv.org/abs/2504.07957v1 Abstract: The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following train...

Apr 12, 2025•22 min•Ep. 667

HoloPart: Generative 3D Part Amodal Segmentation

🤗 Upvotes: 23 | cs.CV Authors: Yunhan Yang, Yuan-Chen Guo, Yukun Huang, Zi-Xin Zou, Zhipeng Yu, Yangguang Li, Yan-Pei Cao, Xihui Liu Title: HoloPart: Generative 3D Part Amodal Segmentation Arxiv: http://arxiv.org/abs/2504.07943v1 Abstract: 3D part amodal segmentation--decomposing a 3D shape into complete, semantically meaningful parts, even when occluded--is a challenging but crucial task for 3D content creation and understanding. Existing 3D part segmentation methods only identify visible surf...

Apr 12, 2025•23 min•Ep. 666

DDT: Decoupled Diffusion Transformer

🤗 Upvotes: 51 | cs.CV, cs.AI Authors: Shuai Wang, Zhi Tian, Weilin Huang, Limin Wang Title: DDT: Decoupled Diffusion Transformer Arxiv: http://arxiv.org/abs/2504.05741v2 Abstract: Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical ...

Apr 11, 2025•20 min•Ep. 665

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

🤗 Upvotes: 43 | cs.CL Authors: Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh, Noah A. Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, Jesse Dodge Titl...

Apr 11, 2025•21 min•Ep. 664

A Unified Agentic Framework for Evaluating Conditional Image Generation

🤗 Upvotes: 25 | cs.CV, cs.CL Authors: Jifang Wang, Xue Yang, Longyue Wang, Zhenran Xu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang Title: A Unified Agentic Framework for Evaluating Conditional Image Generation Arxiv: http://arxiv.org/abs/2504.07046v1 Abstract: Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics...

Apr 11, 2025•21 min•Ep. 663

Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?

🤗 Upvotes: 24 | cs.AI, cs.CL, cs.LG Authors: Chenrui Fan, Ming Li, Lichao Sun, Tianyi Zhou Title: Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill? Arxiv: http://arxiv.org/abs/2504.06514v1 Abstract: We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly ...

Apr 11, 2025•23 min•Ep. 662

OmniSVG: A Unified Scalable Vector Graphics Generation Model

🤗 Upvotes: 91 | cs.CV Authors: Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, Yu-Gang Jiang Title: OmniSVG: A Unified Scalable Vector Graphics Generation Model Arxiv: http://arxiv.org/abs/2504.06263v1 Abstract: Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both desi...

Apr 10, 2025•22 min•Ep. 661

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

🤗 Upvotes: 73 | cs.LG, cs.CL Authors: Gleb Rodionov, Roman Garipov, Alina Shutova, George Yakushev, Vage Egiazarian, Anton Sinitsin, Denis Kuznedelev, Dan Alistarh Title: Hogwild! Inference: Parallel LLM Generation via Concurrent Attention Arxiv: http://arxiv.org/abs/2504.06261v2 Abstract: Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long ...

Apr 10, 2025•24 min•Ep. 660

Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

🤗 Upvotes: 62 | cs.CV, cs.CL Authors: Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou Title: Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought Arxiv: http://arxiv.org/abs/2504.05599v1 Abstract: We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal tr...

Apr 10, 2025•23 min•Ep. 659

An Empirical Study of GPT-4o Image Generation Capabilities

🤗 Upvotes: 50 | cs.CV Authors: Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, Shilin Xu, Tao Zhang, Haobo Yuan, Yikang Zhou, Wei Chow, Linfeng Li, Xiangtai Li, Lei Zhu, Lu Qi Title: An Empirical Study of GPT-4o Image Generation Capabilities Arxiv: http://arxiv.org/abs/2504.05979v1 Abstract: The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to uni...

Apr 10, 2025•22 min•Ep. 658

COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values

🤗 Upvotes: 36 | cs.CL Authors: M-A-P Team, Siwei Wu, Jincheng Ren, Xinrun Du, Shuyue Guo, Xingwei Qu, Yiming Liang, Jie Liu, Yunwen Li, Tianyu Zheng, Boyu Feng, Huaqing Yuan, Zenith Wang, Jiaheng Liu, Wenhao Huang, Chenglin Cai, Haoran Que, Jian Yang, Yuelin Bai, Zekun Moore Wang, Zhouliang Yu, Qunshu Lin, Ding Pan, Yuchen Jiang, Tiannan Wang, Wangchunshu Zhou, Shenzhi Wang, Xingyuan Bu, Minghao Liu, Guoyin Wang, Ge Zhang, Chenghua Lin Title: COIG-P: A High-Quality and Large-Scale Chinese Prefe...

Apr 10, 2025•22 min•Ep. 657

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

🤗 Upvotes: 27 | cs.CV, cs.LG Authors: Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, Qian He Title: Less-to-More Generalization: Unlocking More Controllability by In-Context Generation Arxiv: http://arxiv.org/abs/2504.02160v1 Abstract: Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject dat...

Apr 10, 2025•21 min•Ep. 656

SmolVLM: Redefining small and efficient multimodal models

🤗 Upvotes: 96 | cs.AI, cs.CV Authors: Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, Thomas Wolf Title: SmolVLM: Redefining small and efficient multimodal models Arxiv: http://arxiv.org/abs/2504.05299v1 Abstract: Large Vision-Language Models (VLMs) deliver exceptional performance but require signific...

Apr 09, 2025•26 min•Ep. 655

One-Minute Video Generation with Test-Time Training

🤗 Upvotes: 61 | cs.CV Authors: Karan Dalal, Daniel Koceja, Gashon Hussein, Jiarui Xu, Yue Zhao, Youjin Song, Shihao Han, Ka Chun Cheung, Jan Kautz, Carlos Guestrin, Tatsunori Hashimoto, Sanmi Koyejo, Yejin Choi, Yu Sun, Xiaolong Wang Title: One-Minute Video Generation with Test-Time Training Arxiv: http://arxiv.org/abs/2504.05298v1 Abstract: Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba ...

Apr 09, 2025•19 min•Ep. 654

Rethinking Reflection in Pre-Training

🤗 Upvotes: 52 | cs.CL, cs.AI Authors: Essential AI, :, Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Anthony Polloreno, Ashish Tanwer, Burhan Drak Sibai, Divya S Mansingka, Divya Shivaprasad, Ishaan Shah, Karl Stratos, Khoi Nguyen, Michael Callahan, Michael Pust, Mrinal Iyer, Philip Monk, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Tim Romanski Title: Rethinking Reflection ...

Apr 09, 2025•22 min•Ep. 653

URECA: Unique Region Caption Anything

🤗 Upvotes: 31 | cs.CV, cs.AI Authors: Sangbeom Lim, Junwan Kim, Heeji Yoon, Jaewoo Jung, Seungryong Kim Title: URECA: Unique Region Caption Anything Arxiv: http://arxiv.org/abs/2504.05305v1 Abstract: Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need fo...

Apr 09, 2025•22 min•Ep. 652

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android