Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

🤗 Upvotes: 22 | cs.CL, cs.CV Authors: Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan Title: MM-RLHF: The Next Step Forward in Multimodal LLM Alignment Arxiv: http://arxiv.org/abs/2502.10391v1 Abstract: Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not unde...

Feb 18, 2025•24 min•Ep. 561

ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation

🤗 Upvotes: 12 | cs.CV, cs.GR Authors: Rotem Shalev-Arkushin, Rinon Gal, Amit H. Bermano, Ohad Fried Title: ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation Arxiv: http://arxiv.org/abs/2502.09411v1 Abstract: Diffusion models enable high-quality and diverse visual content synthesis. However, they struggle to generate rare or unseen concepts. To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) with image generation models. We propose Image...

Feb 18, 2025•23 min•Ep. 560

Diverse Inference and Verification for Advanced Reasoning

🤗 Upvotes: 11 | cs.AI Authors: Iddo Drori, Gaston Longhitano, Mao Mao, Seunghwan Hyun, Yuke Zhang, Sungjun Park, Zachary Meeks, Xin-Yu Zhang, Ben Segev, Howard Yong, Nakul Verma, Avi Shporer, Alon Amit, Madeleine Udell Title: Diverse Inference and Verification for Advanced Reasoning Arxiv: http://arxiv.org/abs/2502.09955v1 Abstract: Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding, yet find challenging advanced tasks such as Internati...

Feb 18, 2025•23 min•Ep. 559

Precise Parameter Localization for Textual Generation in Diffusion Models

🤗 Upvotes: 10 | cs.CV Authors: Łukasz Staniszewski, Bartosz Cywiński, Franziska Boenisch, Kamil Deja, Adam Dziedzic Title: Precise Parameter Localization for Textual Generation in Diffusion Models Arxiv: http://arxiv.org/abs/2502.09935v1 Abstract: Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of diffusion models' parameters, all contained in attention layer...

Feb 18, 2025•22 min•Ep. 558

DarwinLM: Evolutionary Structured Pruning of Large Language Models

🤗 Upvotes: 9 | cs.LG, cs.CL Authors: Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, Dan Alistarh Title: DarwinLM: Evolutionary Structured Pruning of Large Language Models Arxiv: http://arxiv.org/abs/2502.07780v1 Abstract: Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressin...

Feb 18, 2025•17 min•Ep. 557

InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU

🤗 Upvotes: 62 | cs.CL, cs.LG Authors: Heejun Lee, Geon Park, Jaduk Suh, Sung Ju Hwang Title: InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU Arxiv: http://arxiv.org/abs/2502.08910v1 Abstract: In modern large language models (LLMs), handling very long context lengths presents significant challenges as it causes slower inference speeds and increased memory costs. Additionally, most existing pre-trained LLMs fail to generalize beyond their original training seq...

Feb 15, 2025•21 min•Ep. 556

The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

🤗 Upvotes: 35 | cs.CL, cs.AI, cs.CV, cs.LG Authors: Mo Yu, Lemao Liu, Junjie Wu, Tsz Ting Chung, Shunchi Zhang, Jiangnan Li, Dit-Yan Yeung, Jie Zhou Title: The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding Arxiv: http://arxiv.org/abs/2502.08946v1 Abstract: In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summat...

Feb 15, 2025•22 min•Ep. 555

Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

🤗 Upvotes: 28 | cs.LG, cs.AI, cs.CV Authors: Hoigi Seo, Wongi Jeong, Jae-sun Seo, Se Young Chun Title: Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation Arxiv: http://arxiv.org/abs/2502.08690v1 Abstract: Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a ...

Feb 15, 2025•20 min•Ep. 554

SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models

🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG Authors: Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih Title: SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models Arxiv: http://arxiv.org/abs/2502.09604v1 Abstract: We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generate...

Feb 15, 2025•22 min•Ep. 553

Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights

🤗 Upvotes: 21 | cs.LG, cs.CV Authors: Jonathan Kahana, Or Nathan, Eliahu Horwitz, Yedid Hoshen Title: Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights Arxiv: http://arxiv.org/abs/2502.09619v1 Abstract: With the increasing numbers of publicly available models, there are probably pretrained, online models for most tasks users require. However, current model search methods are rudimentary, essentially a text-based search in the documentation, thus users cannot find the relev...

Feb 15, 2025•20 min•Ep. 552

An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging

🤗 Upvotes: 21 | cs.CL, cs.AI Authors: Kunat Pipatanakul, Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai Title: An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging Arxiv: http://arxiv.org/abs/2502.09056v1 Abstract: This paper investigates data selection and model merging methodologies aimed at incorporating advanced reasoning capabilities such as those of DeepSeek R1 into language-specific large language models (LLMs), with a part...

Feb 15, 2025•25 min•Ep. 551

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

🤗 Upvotes: 20 | cs.AI, cs.CL, cs.CV Authors: Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang Title: EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents Arxiv: http://arxiv.org/abs/2502.09560v1 Abstract: Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tack...

Feb 15, 2025•21 min•Ep. 550

Exploring the Potential of Encoder-free Architectures in 3D LMMs

🤗 Upvotes: 17 | cs.CV, cs.AI, cs.CL Authors: Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao Title: Exploring the Potential of Encoder-free Architectures in 3D LMMs Arxiv: http://arxiv.org/abs/2502.09620v1 Abstract: Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we ...

Feb 15, 2025•18 min•Ep. 549

CoSER: Coordinating LLM-Based Persona Simulation of Established Roles

🤗 Upvotes: 16 | cs.CL, cs.AI Authors: Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou Title: CoSER: Coordinating LLM-Based Persona Simulation of Established Roles Arxiv: http://arxiv.org/abs/2502.09082v1 Abstract: Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for...

Feb 15, 2025•23 min•Ep. 548

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

🤗 Upvotes: 15 | cs.CV, cs.AI Authors: Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, Yan-Pei Cao Title: TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models Arxiv: http://arxiv.org/abs/2502.06608v1 Abstract: Recent advancements in diffusion techniques have propelled image and video generation to unprece- dented levels of quality, significantly accelerating the deployment and applicatio...

Feb 15, 2025•23 min•Ep. 547

Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance

🤗 Upvotes: 40 | cs.CL Authors: Lingfei Qian, Weipeng Zhou, Yan Wang, Xueqing Peng, Jimin Huang, Qianqian Xie Title: Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance Arxiv: http://arxiv.org/abs/2502.08127v1 Abstract: Recent advancements in large language models (LLMs) have shown strong general reasoning abilities, yet their effectiveness in financial reasoning remains underexplored. In this study, we comprehensively evaluate 16 powerful reasoning and general LLMs on three comp...

Feb 14, 2025•23 min•Ep. 546

TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation

🤗 Upvotes: 35 | cs.CV Authors: Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, Min Li Title: TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation Arxiv: http://arxiv.org/abs/2502.07870v1 Abstract: Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricat...

Feb 14, 2025•19 min•Ep. 545

BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

🤗 Upvotes: 35 | cs.CL Authors: Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei Yuan Title: BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models Arxiv: http://arxiv.org/abs/2502.07346v1 Abstract: Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on. However, measurin...

Feb 14, 2025•20 min•Ep. 544

CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

🤗 Upvotes: 29 | cs.CV Authors: Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai Title: CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation Arxiv: http://arxiv.org/abs/2502.08639v1 Abstract: In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: prec...

Feb 14, 2025•24 min•Ep. 543

Distillation Scaling Laws

🤗 Upvotes: 26 | cs.LG, cs.AI, cs.CL, stat.ML Authors: Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb Title: Distillation Scaling Laws Arxiv: http://arxiv.org/abs/2502.08606v1 Abstract: We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher a...

Feb 14, 2025•22 min•Ep. 542

TransMLA: Multi-Head Latent Attention Is All You Need

🤗 Upvotes: 25 | cs.LG, cs.AI Authors: Fanxu Meng, Zengwei Yao, Muhan Zhang Title: TransMLA: Multi-Head Latent Attention Is All You Need Arxiv: http://arxiv.org/abs/2502.07864v2 Abstract: Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be ca...

Feb 14, 2025•21 min•Ep. 541

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

🤗 Upvotes: 21 | cs.AI, cs.MA Authors: Henry Hengyuan Zhao, Difei Gao, Mike Zheng Shou Title: WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation Arxiv: http://arxiv.org/abs/2502.08047v1 Abstract: Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not be...

Feb 14, 2025•20 min•Ep. 540

LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

🤗 Upvotes: 19 | cs.LG, cs.AI, cs.CL Authors: Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng Title: LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid Arxiv: http://arxiv.org/abs/2502.07563v1 Abstract: Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-firs...

Feb 14, 2025•23 min•Ep. 539

Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning

🤗 Upvotes: 11 | cs.CL, cs.LG Authors: Jean Vassoyan, Nathanaël Beau, Roman Plaud Title: Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning Arxiv: http://arxiv.org/abs/2502.06533v1 Abstract: The ability to achieve long-term goals is a key challenge in the current development of large language models (LLMs). To address this, pre-trained LLMs can be fine-tuned with reinforcement learning (RL) to explore solutions that optimize a given goal. However, exploratio...

Feb 14, 2025•21 min•Ep. 538

Expect the Unexpected: FailSafe Long Context QA for Finance

🤗 Upvotes: 105 | cs.CL Authors: Kiran Kamble, Melisa Russak, Dmytro Mozolevskyi, Muayad Ali, Mateusz Russak, Waseem AlShikh Title: Expect the Unexpected: FailSafe Long Context QA for Finance Arxiv: http://arxiv.org/abs/2502.06329v1 Abstract: We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case ...

Feb 13, 2025•21 min•Ep. 537

Competitive Programming with Large Reasoning Models

🤗 Upvotes: 42 | cs.LG, cs.AI, cs.CL Authors: OpenAI, :, Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg Mürk, Rhythm Garg, Rui Shu, Szymon Sidor, Vineet Kosaraju, Wenda Zhou Title: Competitive Programming with Large Reasoning Models Arxiv: http://arxiv.org/abs/2502.0680...

Feb 13, 2025•21 min•Ep. 536

Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models

🤗 Upvotes: 25 | cs.CL Authors: Mengxi Xiao, Zihao Jiang, Lingfei Qian, Zhengyu Chen, Yueru He, Yijing Xu, Yuecheng Jiang, Dong Li, Ruey-Ling Weng, Min Peng, Jimin Huang, Sophia Ananiadou, Qianqian Xie Title: Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models Arxiv: http://arxiv.org/abs/2502.05878v2 Abstract: Stock movement prediction, a critical task in financial time-series forecasting, relies on identifying and retrieving key influencing factors from va...

Feb 13, 2025•22 min•Ep. 535

CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction

🤗 Upvotes: 23 | cs.CL, cs.AI Authors: Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, Junxian He Title: CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction Arxiv: http://arxiv.org/abs/2502.07316v2 Abstract: Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented tr...

Feb 13, 2025•22 min•Ep. 534

Magic 1-For-1: Generating One Minute Video Clips within One Minute

🤗 Upvotes: 20 | cs.CV Authors: Hongwei Yi, Shitong Shao, Tian Ye, Jiantong Zhao, Qingyu Yin, Michael Lingelbach, Li Yuan, Yonghong Tian, Enze Xie, Daquan Zhou Title: Magic 1-For-1: Generating One Minute Video Clips within One Minute Arxiv: http://arxiv.org/abs/2502.07701v1 Abstract: In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generat...

Feb 13, 2025•21 min•Ep. 533

LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

🤗 Upvotes: 20 | cs.AI Authors: Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica Title: LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! Arxiv: http://arxiv.org/abs/2502.07374v1 Abstract: Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the trainin...

Feb 13, 2025•26 min•Ep. 532

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android