Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

MoM: Linear Sequence Modeling with Mixture-of-Memories

🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG Authors: Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng Title: MoM: Linear Sequence Modeling with Mixture-of-Memories Arxiv: http://arxiv.org/abs/2502.13685v1 Abstract: Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size m...

Feb 21, 2025•20 min•Ep. 591

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

🤗 Upvotes: 22 | cs.CL Authors: William Jurayj, Jeffrey Cheng, Benjamin Van Durme Title: Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering Arxiv: http://arxiv.org/abs/2502.13962v1 Abstract: Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. ...

Feb 21, 2025•21 min•Ep. 590

Craw4LLM: Efficient Web Crawling for LLM Pretraining

🤗 Upvotes: 21 | cs.CL Authors: Shi Yu, Zhiyuan Liu, Chenyan Xiong Title: Craw4LLM: Efficient Web Crawling for LLM Pretraining Arxiv: http://arxiv.org/abs/2502.13347v1 Abstract: Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Crawl4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it...

Feb 21, 2025•23 min•Ep. 589

LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization

🤗 Upvotes: 19 | cs.CL, cs.LG Authors: Guanzheng Chen, Xin Li, Michael Qizhe Shieh, Lidong Bing Title: LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization Arxiv: http://arxiv.org/abs/2502.13922v2 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alig...

Feb 21, 2025•25 min•Ep. 588

Small Models Struggle to Learn from Strong Reasoners

🤗 Upvotes: 17 | cs.AI Authors: Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran Title: Small Models Struggle to Learn from Strong Reasoners Arxiv: http://arxiv.org/abs/2502.12143v1 Abstract: Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability...

Feb 21, 2025•20 min•Ep. 587

Autellix: An Efficient Serving Engine for LLM Agents as General Programs

🤗 Upvotes: 15 | cs.LG, cs.AI, cs.DC Authors: Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, Ion Stoica Title: Autellix: An Efficient Serving Engine for LLM Agents as General Programs Arxiv: http://arxiv.org/abs/2502.13965v1 Abstract: Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help ...

Feb 21, 2025•23 min•Ep. 586

SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?

🤗 Upvotes: 10 | cs.CL, cs.AI, cs.IR, cs.IT, math.IT Authors: Yucheng Shi, Tianze Yang, Canyu Chen, Quanzheng Li, Tianming Liu, Xiang Li, Ninghao Liu Title: SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering? Arxiv: http://arxiv.org/abs/2502.13233v1 Abstract: Large Language Models (LLMs) have shown remarkable capabilities in general domains but often struggle with tasks requiring specialized knowledge. Conventional Retrieval-Augmented Generation (RAG) techniques ty...

Feb 21, 2025•22 min•Ep. 585

Soundwave: Less is More for Speech-Text Alignment in LLMs

🤗 Upvotes: 65 | cs.CL, cs.AI, cs.SD Authors: Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li Title: Soundwave: Less is More for Speech-Text Alignment in LLMs Arxiv: http://arxiv.org/abs/2502.12900v1 Abstract: Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap ...

Feb 20, 2025•22 min•Ep. 584

Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

🤗 Upvotes: 51 | cs.CL, cs.LG Authors: Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev Title: Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity Arxiv: http://arxiv.org/abs/2502.13063v1 Abstract: A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce t...

Feb 20, 2025•23 min•Ep. 583

Continuous Diffusion Model for Language Modeling

🤗 Upvotes: 44 | cs.LG Authors: Jaehyeong Jo, Sung Ju Hwang Title: Continuous Diffusion Model for Language Modeling Arxiv: http://arxiv.org/abs/2502.11564v1 Abstract: Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. Yet diffusion models that directly work on discrete data space do not fully exploit the power of iterative refinement, as the signals are lost during the transition between discrete states. Existing continuous di...

Feb 20, 2025•19 min•Ep. 582

Phantom: Subject-consistent video generation via cross-modal alignment

🤗 Upvotes: 42 | cs.CV, cs.AI Authors: Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, Xinglong Wu Title: Phantom: Subject-consistent video generation via cross-modal alignment Arxiv: http://arxiv.org/abs/2502.11079v1 Abstract: The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject el...

Feb 20, 2025•21 min•Ep. 581

Rethinking Diverse Human Preference Learning through Principal Component Analysis

🤗 Upvotes: 33 | cs.AI, cs.CL Authors: Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen Title: Rethinking Diverse Human Preference Learning through Principal Component Analysis Arxiv: http://arxiv.org/abs/2502.13131v1 Abstract: Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capt...

Feb 20, 2025•23 min•Ep. 580

Magma: A Foundation Model for Multimodal AI Agents

🤗 Upvotes: 30 | cs.CV, cs.AI, cs.HC, cs.LG, cs.RO Authors: Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao Title: Magma: A Foundation Model for Multimodal AI Agents Arxiv: http://arxiv.org/abs/2502.13130v1 Abstract: We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (...

Feb 20, 2025•23 min•Ep. 579

Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

🤗 Upvotes: 29 | cs.CV Authors: Bencheng Liao, Hongyuan Tao, Qian Zhang, Tianheng Cheng, Yingyue Li, Haoran Yin, Wenyu Liu, Xinggang Wang Title: Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation Arxiv: http://arxiv.org/abs/2502.13145v1 Abstract: Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and ...

Feb 20, 2025•22 min•Ep. 578

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

🤗 Upvotes: 27 | cs.RO, cs.AI, cs.CV Authors: Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi Title: SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation Arxiv: http://arxiv.org/abs/2502.13143v1 Abstract: Spatial intelligence is a critical component of embodied AI, promoting robots to understand and int...

Feb 20, 2025•21 min•Ep. 577

SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models

🤗 Upvotes: 26 | cs.CL Authors: Seanie Lee, Dong Bok Lee, Dominik Wagner, Minki Kang, Haebin Seong, Tobias Bocklet, Juho Lee, Sung Ju Hwang Title: SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models Arxiv: http://arxiv.org/abs/2502.12464v1 Abstract: Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong perfo...

Feb 20, 2025•21 min•Ep. 576

You Do Not Fully Utilize Transformer's Representation Capacity

🤗 Upvotes: 25 | cs.LG, cs.CL Authors: Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov Title: You Do Not Fully Utilize Transformer's Representation Capacity Arxiv: http://arxiv.org/abs/2502.09245v1 Abstract: In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show...

Feb 20, 2025•21 min•Ep. 575

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

🤗 Upvotes: 68 | cs.CL, cs.AI, cs.LG Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng Title: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Arxiv: http://arxiv.org/abs/2502.11089v1 Abstract: Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention...

Feb 19, 2025•23 min•Ep. 574

Learning Getting-Up Policies for Real-World Humanoid Robots

🤗 Upvotes: 32 | cs.RO, cs.LG Authors: Xialin He, Runpei Dong, Zixuan Chen, Saurabh Gupta Title: Learning Getting-Up Policies for Real-World Humanoid Robots Arxiv: http://arxiv.org/abs/2502.12152v1 Abstract: Automatic fall recovery is a crucial prerequisite before humanoid robots can be reliably deployed. Hand-designing controllers for getting up is difficult because of the varied configurations a humanoid can end up in after a fall and the challenging terrains humanoid robots are expected to op...

Feb 19, 2025•25 min•Ep. 573

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

🤗 Upvotes: 27 | cs.LG, cs.SE Authors: Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke Title: SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? Arxiv: http://arxiv.org/abs/2502.12115v1 Abstract: We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks--ranging from \$50 bug fi...

Feb 19, 2025•22 min•Ep. 572

CRANE: Reasoning with constrained LLM generation

🤗 Upvotes: 17 | cs.PL, cs.LG Authors: Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, Gagandeep Singh Title: CRANE: Reasoning with constrained LLM generation Arxiv: http://arxiv.org/abs/2502.09061v1 Abstract: Code generation, symbolic math reasoning, and other tasks require LLMs to produce outputs that are both syntactically and semantically correct. Constrained LLM generation is a promising direction to enforce adherence to formal grammar, but prior works have empirically obs...

Feb 19, 2025•21 min•Ep. 571

How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training

🤗 Upvotes: 16 | cs.LG, cs.AI, cs.CL, cs.CV, cs.HC Authors: Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, Huajun Chen Title: How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training Arxiv: http://arxiv.org/abs/2502.11196v1 Abstract: Despite exceptional capabilities in knowledge-intensive tasks, Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge, particularly how to structu...

Feb 19, 2025•25 min•Ep. 570

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

🤗 Upvotes: 15 | cs.CV Authors: Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui Title: HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation Arxiv: http://arxiv.org/abs/2502.12148v1 Abstract: The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understandin...

Feb 19, 2025•20 min•Ep. 569

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

🤗 Upvotes: 14 | cs.LG, cs.AI Authors: Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu Title: I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models Arxiv: http://arxiv.org/abs/2502.10458v1 Abstract: This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vi...

Feb 19, 2025•22 min•Ep. 568

SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

🤗 Upvotes: 11 | cs.LG, cs.CL Authors: Bohan Lyu, Siqiao Huang, Zichen Liang Title: SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors Arxiv: http://arxiv.org/abs/2502.11167v1 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as code understanding and code generation. However, an equally important yet underexplored question is whether LLMs can serve as general-purpose surrogate code executors, to...

Feb 19, 2025•24 min•Ep. 567

Region-Adaptive Sampling for Diffusion Transformers

🤗 Upvotes: 46 | cs.CV, cs.AI Authors: Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang Title: Region-Adaptive Sampling for Diffusion Transformers Arxiv: http://arxiv.org/abs/2502.10389v1 Abstract: Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on re...

Feb 18, 2025•23 min•Ep. 566

Large Language Diffusion Models

🤗 Upvotes: 44 | cs.CL, cs.LG Authors: Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li Title: Large Language Diffusion Models Arxiv: http://arxiv.org/abs/2502.09992v1 Abstract: Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LL...

Feb 18, 2025•19 min•Ep. 565

The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks

🤗 Upvotes: 41 | cs.AI Authors: Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, Joseph E. Gonzalez Title: The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks Arxiv: http://arxiv.org/abs/2502.08235v1 Abstract: Large Reasoning Models (LRMs) represent a breakthrough in AI problem-solving capabilities, bu...

Feb 18, 2025•27 min•Ep. 564

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

🤗 Upvotes: 38 | cs.CV, cs.CL Authors: Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Ni...

Feb 18, 2025•23 min•Ep. 563

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

🤗 Upvotes: 27 | cs.CV Authors: Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, Kai Han, Samuel Albanie Title: ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models Arxiv: http://arxiv.org/abs/2502.09696v1...

Feb 18, 2025•23 min•Ep. 562

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android