Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

🤗 Upvotes: 4 | cs.CV, cs.AI, cs.CL, cs.LG Authors: Wan-Cyuan Fan, Tanzila Rahman, Leonid Sigal Title: MMFactory: A Universal Solution Search Engine for Vision-Language Tasks Arxiv: http://arxiv.org/abs/2412.18072v1 Abstract: With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single ...

Dec 28, 2024•21 min•Ep. 291

Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation

🤗 Upvotes: 2 | cs.IR, cs.AI Authors: Yucong Luo, Qitao Qin, Hao Zhang, Mingyue Cheng, Ruiran Yan, Kefan Wang, Jie Ouyang Title: Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation Arxiv: http://arxiv.org/abs/2412.18176v1 Abstract: Sequential recommendation (SR) systems have evolved significantly over the past decade, transitioning from traditional collaborative filtering to deep learning approaches and, more recently, to large language models (LL...

Dec 28, 2024•22 min•Ep. 290

DepthLab: From Partial to Complete

🤗 Upvotes: 21 | cs.CV Authors: Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng Chen, Ping Luo Title: DepthLab: From Partial to Complete Arxiv: http://arxiv.org/abs/2412.18153v1 Abstract: Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting ...

Dec 26, 2024•22 min•Ep. 289

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

🤗 Upvotes: 20 | cs.AI, cs.CL Authors: Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Ning Ding, Youbang Sun, Biqing Qi, Yuchen Fan, Xue Kai Zhu, Bowen Zhou Title: Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization Arxiv: http://arxiv.org/abs/2412.17739v1 Abstract: Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attentio...

Dec 26, 2024•22 min•Ep. 288

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

🤗 Upvotes: 10 | cs.CV, cs.AI, cs.MM Authors: Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, Xiangyu Yue Title: DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation Arxiv: http://arxiv.org/abs/2412.18597v1 Abstract: Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation ...

Dec 26, 2024•22 min•Ep. 287

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

🤗 Upvotes: 8 | cs.CL, cs.AI Authors: Łukasz Borchmann Title: In Case You Missed It: ARC 'Challenge' Is Not That Challenging Arxiv: http://arxiv.org/abs/2412.17758v1 Abstract: ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet ...

Dec 26, 2024•24 min•Ep. 286

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

🤗 Upvotes: 8 | cs.LG Authors: Ziteng Wang, Jianfei Chen, Jun Zhu Title: ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing Arxiv: http://arxiv.org/abs/2412.14711v1 Abstract: Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, ...

Dec 26, 2024•21 min•Ep. 285

SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval

🤗 Upvotes: 6 | cs.CL Authors: Aakash Mahalingam, Vinesh Kumar Gande, Aman Chadha, Vinija Jain, Divya Chaudhary Title: SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval Arxiv: http://arxiv.org/abs/2412.15443v1 Abstract: Retrieval-Augmented Generation (RAG) systems have become pivotal in leveraging vast corpora to generate informed and contextually relevant responses, notably reducing hallucinations in Large Language Models. Despite significant advancements, these sy...

Dec 26, 2024•22 min•Ep. 284

PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models

🤗 Upvotes: 5 | cs.CV Authors: Minghao Chen, Roman Shapovalov, Iro Laina, Tom Monnier, Jianyuan Wang, David Novotny, Andrea Vedaldi Title: PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models Arxiv: http://arxiv.org/abs/2412.18608v1 Abstract: Text- or image-to-3D generators and 3D scanners can now produce 3D assets with high-quality shapes and textures. These assets typically consist of a single, fused representation, like an implicit neural field, a Gaussian mix...

Dec 26, 2024•26 min•Ep. 283

MotiF: Making Text Count in Image Animation with Motion Focal Loss

🤗 Upvotes: 3 | cs.CV, cs.AI Authors: Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin Title: MotiF: Making Text Count in Image Animation with Motion Focal Loss Arxiv: http://arxiv.org/abs/2412.16153v1 Abstract: Text-Image-to-Video (TI2V) generation aims to generate a video from an image following a text description, which is also referred to as text-guided image animation. Most existing methods struggle to generate videos that align well with the text prompts, parti...

Dec 26, 2024•23 min•Ep. 282

Bridging the Data Provenance Gap Across Text, Speech and Video

🤗 Upvotes: 3 | cs.AI, cs.CL, cs.CY, cs.LG, cs.MM Authors: Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole,...

Dec 26, 2024•25 min•Ep. 281

RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response

🤗 Upvotes: 64 | cs.CL, cs.AI Authors: Junyu Luo, Xiao Luo, Kaize Ding, Jingyang Yuan, Zhiping Xiao, Ming Zhang Title: RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response Arxiv: http://arxiv.org/abs/2412.14922v1 Abstract: Supervised fine-tuning (SFT) plays a crucial role in adapting large language models (LLMs) to specific domains or tasks. However, as demonstrated by empirical experiments, the collected data inevitably contains noise in practical applications,...

Dec 25, 2024•22 min•Ep. 280

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

🤗 Upvotes: 29 | cs.AI, cs.CL, cs.LG Authors: Weihao Zeng, Yuzhen Huang, Lulu Zhao, Yijun Wang, Zifei Shan, Junxian He Title: B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners Arxiv: http://arxiv.org/abs/2412.17256v1 Abstract: In the absence of extensive human-annotated data for complex reasoning tasks, self-improvement -- where models are trained on their own outputs -- has emerged as a primary method for enhancing performance. However, the critical factors ...

Dec 25, 2024•21 min•Ep. 279

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

🤗 Upvotes: 26 | cs.CV, cs.LG Authors: Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin Title: Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching Arxiv: http://arxiv.org/abs/2412.17153v2 Abstract: Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or ...

Dec 25, 2024•24 min•Ep. 278

Diving into Self-Evolving Training for Multimodal Reasoning

🤗 Upvotes: 23 | cs.CL, cs.AI, cs.CV, cs.LG Authors: Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He Title: Diving into Self-Evolving Training for Multimodal Reasoning Arxiv: http://arxiv.org/abs/2412.17451v1 Abstract: Reasoning ability is essential for Large Multimodal Models (LMMs). In the absence of multimodal chain-of-thought annotated data, self-evolving training, where the model learns from its own outputs, has emerged as an effective and scalable approach for enhancing re...

Dec 25, 2024•21 min•Ep. 277

Deliberation in Latent Space via Differentiable Cache Augmentation

🤗 Upvotes: 16 | cs.CL, cs.AI, cs.LG Authors: Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam Title: Deliberation in Latent Space via Differentiable Cache Augmentation Arxiv: http://arxiv.org/abs/2412.17747v1 Abstract: Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before resp...

Dec 25, 2024•22 min•Ep. 276

Large Motion Video Autoencoding with Cross-modal Video VAE

🤗 Upvotes: 15 | cs.CV Authors: Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen Title: Large Motion Video Autoencoding with Cross-modal Video VAE Arxiv: http://arxiv.org/abs/2412.17805v1 Abstract: Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compre...

Dec 25, 2024•25 min•Ep. 275

OpenAI o1 System Card

🤗 Upvotes: 12 | cs.AI Authors: OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Be...

Dec 25, 2024•25 min•Ep. 274

Revisiting In-Context Learning with Long Context Language Models

🤗 Upvotes: 12 | cs.CL, cs.AI, cs.LG Authors: Jinheon Baek, Sun Jae Lee, Prakhar Gupta, Geunseob, Oh, Siddharth Dalmia, Prateek Kolhar Title: Revisiting In-Context Learning with Long Context Language Models Arxiv: http://arxiv.org/abs/2412.16926v1 Abstract: In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making ex...

Dec 25, 2024•24 min•Ep. 273

Outcome-Refining Process Supervision for Code Generation

🤗 Upvotes: 11 | cs.CL, cs.AI, cs.LG, cs.SE Authors: Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang Title: Outcome-Refining Process Supervision for Code Generation Arxiv: http://arxiv.org/abs/2412.15118v1 Abstract: Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows pr...

Dec 25, 2024•21 min•Ep. 272

LearnLM: Improving Gemini for Learning

🤗 Upvotes: 9 | cs.CY, cs.AI, cs.LG Authors: LearnLM Team, Abhinit Modi, Aditya Srikanth Veerubhotla, Aliya Rysbek, Andrea Huber, Brett Wiltshire, Brian Veprek, Daniel Gillick, Daniel Kasenberg, Derek Ahmed, Irina Jurenka, James Cohan, Jennifer She, Julia Wilkowski, Kaiz Alarakyia, Kevin McKee, Lisa Wang, Markus Kunesch, Mike Schaekermann, Miruna Pîslar, Nikhil Joshi, Parsa Mahmoudieh, Paul Jhun, Sara Wiltberger, Shakir Mohamed, Shashank Agarwal, Shubham Milind Phal, Sun Jae Lee, Theofilos Strin...

Dec 25, 2024•27 min•Ep. 271

Parallelized Autoregressive Visual Generation

🤗 Upvotes: 34 | cs.CV Authors: Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, Xihui Liu Title: Parallelized Autoregressive Visual Generation Arxiv: http://arxiv.org/abs/2412.15119v1 Abstract: Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process. In this paper, we propose a simple yet effective approach for parallelized au...

Dec 24, 2024•23 min•Ep. 270

Offline Reinforcement Learning for LLM Multi-Step Reasoning

🤗 Upvotes: 19 | cs.LG, cs.AI, cs.CL Authors: Huaijie Wang, Shibo Hao, Hanze Dong, Shenao Zhang, Yilin Bao, Ziran Yang, Yi Wu Title: Offline Reinforcement Learning for LLM Multi-Step Reasoning Arxiv: http://arxiv.org/abs/2412.16145v1 Abstract: Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with ...

Dec 24, 2024•21 min•Ep. 269

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

🤗 Upvotes: 17 | cs.CL Authors: Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou Title: SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation Arxiv: http://arxiv.org/abs/2412.13649v1 Abstract: Key-Value (KV) cache has become a bottleneck of LLMs for long-context generation. Despite the numerous efforts in this area, the optimization for the decoding phase is generally ignored. However, we believe such optimization is crucial, especially for long-output ...

Dec 24, 2024•22 min•Ep. 268

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

🤗 Upvotes: 13 | cs.CV Authors: Songhua Liu, Zhenxiong Tan, Xinchao Wang Title: CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up Arxiv: http://arxiv.org/abs/2412.16112v1 Abstract: Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this iss...

Dec 24, 2024•25 min•Ep. 267

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

🤗 Upvotes: 12 | cs.CV, cs.LG, cs.SD, eess.AS Authors: Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji Title: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis Arxiv: http://arxiv.org/abs/2412.15322v1 Abstract: We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on ...

Dec 24, 2024•23 min•Ep. 266

Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

🤗 Upvotes: 9 | cs.CV Authors: Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, Sungroh Yoon Title: Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage Arxiv: http://arxiv.org/abs/2412.15484v1 Abstract: Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We...

Dec 24, 2024•28 min•Ep. 265

Sequence Matters: Harnessing Video Models in 3D Super-Resolution

🤗 Upvotes: 6 | cs.CV, 68U10, 68T10, I.4.5; I.2.10 Authors: Hyun-kyu Ko, Dongheok Park, Youngin Park, Byeonghyeon Lee, Juhee Han, Eunbyung Park Title: Sequence Matters: Harnessing Video Models in 3D Super-Resolution Arxiv: http://arxiv.org/abs/2412.11525v3 Abstract: 3D super-resolution aims to reconstruct high-fidelity 3D models from low-resolution (LR) multi-view images. Early studies primarily focused on single-image super-resolution (SISR) models to upsample LR images into high-resolution ima...

Dec 24, 2024•22 min•Ep. 264

TRecViT: A Recurrent Video Transformer

🤗 Upvotes: 5 | cs.CV, cs.LG Authors: Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu Title: TRecViT: A Recurrent Video Transformer Arxiv: http://arxiv.org/abs/2412.14294v1 Abstract: We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurre...

Dec 24, 2024•25 min•Ep. 263

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

🤗 Upvotes: 4 | cs.LG Authors: Zhen Zheng, Xiaonan Song, Chuanjie Liu Title: MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design Arxiv: http://arxiv.org/abs/2412.14590v1 Abstract: Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or system inefficiency. In this paper, we make a comp...

Dec 24, 2024•23 min•Ep. 262

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android