Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Teaching Language Models to Critique via Reinforcement Learning

🤗 Upvotes: 16 | cs.LG, cs.AI, cs.CL Authors: Zhihui Xie, Jie chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong Title: Teaching Language Models to Critique via Reinforcement Learning Arxiv: http://arxiv.org/abs/2502.03492v1 Abstract: Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we ...

Feb 13, 2025•22 min•Ep. 531

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

🤗 Upvotes: 15 | cs.CV Authors: Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, Xiaohua Zhai Title: Scaling Pre-training to One Hundred Billion Data for Vision Language Models Arxiv: http://arxiv.org/abs/2502.07617v1 Abstract: We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric classification an...

Feb 13, 2025•23 min•Ep. 530

Enhance-A-Video: Better Generated Video for Free

🤗 Upvotes: 14 | cs.CV Authors: Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You Title: Enhance-A-Video: Better Generated Video for Free Arxiv: http://arxiv.org/abs/2502.07508v1 Abstract: DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named...

Feb 13, 2025•21 min•Ep. 529

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

🤗 Upvotes: 71 | cs.CL Authors: Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, Bowen Zhou Title: Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling Arxiv: http://arxiv.org/abs/2502.06703v1 Abstract: Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Pr...

Feb 12, 2025•23 min•Ep. 528

SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

🤗 Upvotes: 71 | cs.CL Authors: Daniil Moskovskiy, Nikita Sushko, Sergey Pletenev, Elena Tutubalina, Alexander Panchenko Title: SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators Arxiv: http://arxiv.org/abs/2502.06394v1 Abstract: Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce Sy...

Feb 12, 2025•22 min•Ep. 527

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

🤗 Upvotes: 36 | cs.CL, cs.LG Authors: Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, Kai Chen Title: Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning Arxiv: http://arxiv.org/abs/2502.06781v1 Abstract: Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligen...

Feb 12, 2025•23 min•Ep. 526

Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning

🤗 Upvotes: 22 | cs.AI, cs.CL, cs.LG, cs.MA Authors: Bidipta Sarkar, Warren Xia, C. Karen Liu, Dorsa Sadigh Title: Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning Arxiv: http://arxiv.org/abs/2502.06060v1 Abstract: Communicating in natural language is a powerful tool in multi-agent settings, as it enables independent agents to share information in partially observable settings and allows zero-shot coordination with humans. However, most prior works are limite...

Feb 12, 2025•22 min•Ep. 525

CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging

🤗 Upvotes: 17 | cs.CL, cs.AI Authors: Md. Ashraful Islam, Mohammed Eunus Ali, Md Rizwan Parvez Title: CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging Arxiv: http://arxiv.org/abs/2502.05664v1 Abstract: Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse pro...

Feb 12, 2025•21 min•Ep. 524

LM2: Large Memory Models

🤗 Upvotes: 16 | cs.CL, cs.AI Authors: Jikun Kang, Wenqi Wu, Filippos Christianos, Alex J. Chan, Fraser Greenlee, George Thomas, Marvin Purtorab, Andy Toulis Title: LM2: Large Memory Models Arxiv: http://arxiv.org/abs/2502.06049v1 Abstract: This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module that aims to address the limitations of standard Transformers in multi-step reasoning, relational argumentation, and synthesiz...

Feb 12, 2025•26 min•Ep. 523

Matryoshka Quantization

🤗 Upvotes: 13 | cs.LG, cs.AI Authors: Pranav Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati Title: Matryoshka Quantization Arxiv: http://arxiv.org/abs/2502.06786v1 Abstract: Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practi...

Feb 12, 2025•23 min•Ep. 522

Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation

🤗 Upvotes: 13 | cs.CV, cs.AI Authors: Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng Title: Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation Arxiv: http://arxiv.org/abs/2502.05415v1 Abstract: There has been increasing research interest in building unified multimodal understanding and generation models, among which Show-o stands as a notable representative, demonstrating great promise for both text-to-image and image-to-text generation. The ...

Feb 12, 2025•19 min•Ep. 521

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

🤗 Upvotes: 12 | cs.CL Authors: Sukmin Cho, Sangjin Choi, Taeho Hwang, Jeongyeon Seo, Soyeong Jeong, Huije Lee, Hoyun Song, Jong C. Park, Youngjin Kwon Title: Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding Arxiv: http://arxiv.org/abs/2502.05609v1 Abstract: Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Specul...

Feb 12, 2025•21 min•Ep. 520

ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

🤗 Upvotes: 11 | cs.CL Authors: Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang Title: ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates Arxiv: http://arxiv.org/abs/2502.06772v1 Abstract: We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and intr...

Feb 12, 2025•21 min•Ep. 519

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

🤗 Upvotes: 52 | cs.CV Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin Title: VideoRoPE: What Makes for Good Video Rotary Position Embedding? Arxiv: http://arxiv.org/abs/2502.05173v1 Abstract: While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains...

Feb 11, 2025•20 min•Ep. 518

Fast Video Generation with Sliding Tile Attention

🤗 Upvotes: 39 | cs.CV Authors: Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, Hao Zhang Title: Fast Video Generation with Sliding Tile Attention Arxiv: http://arxiv.org/abs/2502.04507v1 Abstract: Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper intr...

Feb 11, 2025•21 min•Ep. 517

Goku: Flow Based Video Generative Foundation Models

🤗 Upvotes: 39 | cs.CV Authors: Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu Title: Goku: Flow Based Video Generative Foundation Models Arxiv: http://arxiv.org/abs/2502.04896v2 Abstract: This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models ...

Feb 11, 2025•22 min•Ep. 516

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

🤗 Upvotes: 32 | cs.LG Authors: Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh Title: QuEST: Stable Training of LLMs with 1-Bit Weights and Activations Arxiv: http://arxiv.org/abs/2502.05003v1 Abstract: One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accura...

Feb 11, 2025•23 min•Ep. 515

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

🤗 Upvotes: 30 | cs.LG, cs.CL Authors: Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein Title: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach Arxiv: http://arxiv.org/abs/2502.05171v1 Abstract: We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent bl...

Feb 11, 2025•23 min•Ep. 514

AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting

🤗 Upvotes: 23 | cs.CV Authors: Chung-Ho Wu, Yang-Jung Chen, Ying-Huan Chen, Jie-Ying Lee, Bo-Hsu Ke, Chun-Wei Tuan Mu, Yi-Chuan Huang, Chin-Yang Lin, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu Title: AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting Arxiv: http://arxiv.org/abs/2502.05176v1 Abstract: Three-dimensional scene inpainting is crucial for applications from virtual reality to architectural visualization, yet existing methods struggle with v...

Feb 11, 2025•18 min•Ep. 513

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

🤗 Upvotes: 18 | cs.CL, cs.LG Authors: Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li Title: DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails Arxiv: http://arxiv.org/abs/2502.05163v1 Abstract: The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplor...

Feb 11, 2025•19 min•Ep. 512

Agency Is Frame-Dependent

🤗 Upvotes: 15 | cs.AI Authors: David Abel, André Barreto, Michael Bowling, Will Dabney, Shi Dong, Steven Hansen, Anna Harutyunyan, Khimya Khetarpal, Clare Lyle, Razvan Pascanu, Georgios Piliouras, Doina Precup, Jonathan Richens, Mark Rowland, Tom Schaul, Satinder Singh Title: Agency Is Frame-Dependent Arxiv: http://arxiv.org/abs/2502.04403v1 Abstract: Agency is a system's capacity to steer outcomes toward a goal, and is a central topic of study across biology, philosophy, cognitive science, and...

Feb 11, 2025•23 min•Ep. 511

FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

🤗 Upvotes: 14 | cs.CV Authors: Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, Ping Luo Title: FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation Arxiv: http://arxiv.org/abs/2502.05179v1 Abstract: DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, howe...

Feb 11, 2025•23 min•Ep. 510

Generating Symbolic World Models via Test-time Scaling of Large Language Models

🤗 Upvotes: 13 | cs.AI Authors: Zhouliang Yu, Yuhuan Yuan, Tim Z. Xiao, Fuxiang Frank Xia, Jie Fu, Ge Zhang, Ge Lin, Weiyang Liu Title: Generating Symbolic World Models via Test-time Scaling of Large Language Models Arxiv: http://arxiv.org/abs/2502.04728v1 Abstract: Solving complex planning problems requires Large Language Models (LLMs) to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality-a task hindered by the inherent ambiguity of na...

Feb 11, 2025•21 min•Ep. 509

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

🤗 Upvotes: 41 | cs.LG, cs.CL Authors: Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov Title: Analyze Feature Flow to Enhance Interpretation and Steering in Language Models Arxiv: http://arxiv.org/abs/2502.03032v2 Abstract: We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique...

Feb 08, 2025•24 min•Ep. 508

UltraIF: Advancing Instruction Following from the Wild

🤗 Upvotes: 15 | cs.CL, cs.AI Authors: Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, Baobao Chang Title: UltraIF: Advancing Instruction Following from the Wild Arxiv: http://arxiv.org/abs/2502.04153v1 Abstract: Instruction-following made modern large language models (LLMs) helpful assistants. However, the key to taming LLMs on complex instructions remains mysterious, for that there are huge gaps between models trained by open-source community and those trained by leading comp...

Feb 08, 2025•20 min•Ep. 507

Great Models Think Alike and this Undermines AI Oversight

🤗 Upvotes: 14 | cs.LG, cs.AI, cs.CL Authors: Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping Title: Great Models Think Alike and this Undermines AI Oversight Arxiv: http://arxiv.org/abs/2502.04313v1 Abstract: As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks...

Feb 08, 2025•30 min•Ep. 506

Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2

🤗 Upvotes: 14 | cs.AI, cs.LG Authors: Yuri Chervonyi, Trieu H. Trinh, Miroslav Olšák, Xiaomeng Yang, Hoang Nguyen, Marcelo Menegali, Junehyuk Jung, Vikas Verma, Quoc V. Le, Thang Luong Title: Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2 Arxiv: http://arxiv.org/abs/2502.03544v1 Abstract: We present AlphaGeometry2, a significantly improved version of AlphaGeometry introduced in Trinh et al. (2024), which has now surpassed an average gold medalist in solving Olympiad ...

Feb 08, 2025•23 min•Ep. 505

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment

🤗 Upvotes: 14 | cs.CV, cs.CL, cs.MM, cs.SD, eess.AS, eess.IV Authors: Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao Title: Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment Arxiv: http://arxiv.org/abs/2502.04328v1 Abstract: Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-s...

Feb 08, 2025•21 min•Ep. 504

MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

🤗 Upvotes: 13 | cs.CV Authors: Ziyan Guo, Zeyu Hu, Na Zhao, De Wen Soh Title: MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm Arxiv: http://arxiv.org/abs/2502.02358v3 Abstract: Human motion generation and editing are key components of computer graphics and vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some ef...

Feb 08, 2025•21 min•Ep. 503

MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion

🤗 Upvotes: 13 | cs.CL Authors: Xintong Hao, Ke Shen, Chenggang Li Title: MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion Arxiv: http://arxiv.org/abs/2502.04235v1 Abstract: Despite the remarkable capabilities of large language models across various tasks, their continued scaling faces a critical challenge: the scarcity of high-quality pretraining data. While model architectures continue to evolve, the natural language data struggles to scale up. To tackle this bottlene...

Feb 08, 2025•23 min•Ep. 502

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android