Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

One Diffusion to Generate Them All

🤗 Paper Upvotes: 13 | cs.CV, cs.AI Authors: Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, Jiasen Lu Title: One Diffusion to Generate Them All Arxiv: http://arxiv.org/abs/2411.16318v1 Abstract: We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, an...

Nov 27, 2024•23 min•Ep. 141

VisualLens: Personalization through Visual History

🤗 Paper Upvotes: 13 | cs.CV Authors: Wang Bill Zhu, Deqing Fu, Kai Sun, Yi Lu, Zhaojiang Lin, Seungwhan Moon, Kanika Narang, Mustafa Canim, Yue Liu, Anuj Kumar, Xin Luna Dong Title: VisualLens: Personalization through Visual History Arxiv: http://arxiv.org/abs/2411.16034v1 Abstract: We hypothesize that a user's visual history with images reflecting their daily life, offers valuable insights into their interests and preferences, and can be leveraged for personalization. Among the many challenges...

Nov 27, 2024•25 min•Ep. 140

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training

🤗 Paper Upvotes: 38 | cs.CL Authors: Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi Title: TÜLU 3: Pushing Frontiers in Open Language Model Post-Training Arxiv: http://arxiv.org/abs/2411.15124...

Nov 26, 2024•26 min•Ep. 139

Style-Friendly SNR Sampler for Style-Driven Generation

🤗 Paper Upvotes: 28 | cs.CV Authors: Jooyoung Choi, Chaehun Shin, Yeongtak Oh, Heeseung Kim, Sungroh Yoon Title: Style-Friendly SNR Sampler for Style-Driven Generation Arxiv: http://arxiv.org/abs/2411.14793v1 Abstract: Recent large-scale diffusion models generate high-quality images but struggle to learn new, personalized artistic styles, which limits the creation of unique style templates. Fine-tuning with reference images is the most promising approach, but it often blindly utilizes objective...

Nov 26, 2024•20 min•Ep. 138

OminiControl: Minimal and Universal Control for Diffusion Transformer

🤗 Paper Upvotes: 22 | cs.CV, cs.AI, cs.LG Authors: Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, Xinchao Wang Title: OminiControl: Minimal and Universal Control for Diffusion Transformer Arxiv: http://arxiv.org/abs/2411.15098v1 Abstract: In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enablin...

Nov 26, 2024•26 min•Ep. 137

A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

🤗 Paper Upvotes: 15 | cs.CL, cs.LG, 68T50, I.2.7 Authors: Gabriel Chua, Shing Yee Chan, Shaun Khoo Title: A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection Arxiv: http://arxiv.org/abs/2411.12946v1 Abstract: Large Language Models are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails, which often rely on curated examples or custom classifiers, suffer from high false...

Nov 26, 2024•24 min•Ep. 136

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

🤗 Paper Upvotes: 14 | cs.AI Authors: Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel Title: BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games Arxiv: http://arxiv.org/abs/2411.13543v1 Abstract: Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning ...

Nov 26, 2024•27 min•Ep. 135

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

🤗 Paper Upvotes: 12 | cs.CV, cs.CL Authors: Kaichen Zhang, Yifei Shen, Bo Li, Ziwei Liu Title: Large Multi-modal Models Can Interpret Features in Large Multi-modal Models Arxiv: http://arxiv.org/abs/2411.14982v1 Abstract: Recent advances in Large Multimodal Models (LMMs) lead to significant breakthroughs in both academia and industry. One question that arises is how we, as humans, can understand their internal neural representations. This paper takes an initial step towards addressing this ques...

Nov 26, 2024•23 min•Ep. 134

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

🤗 Paper Upvotes: 9 | cs.CV, cs.AI, cs.CL Authors: Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu Title: VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Arxiv: http://arxiv.org/abs/2411.14794v1 Abstract: The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scar...

Nov 26, 2024•21 min•Ep. 133

Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction

🤗 Paper Upvotes: 9 | cs.CV, cs.AI, cs.LG Authors: Huiwon Jang, Sihyun Yu, Jinwoo Shin, Pieter Abbeel, Younggyo Seo Title: Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction Arxiv: http://arxiv.org/abs/2411.14762v1 Abstract: Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the tempora...

Nov 26, 2024•26 min•Ep. 132

MyTimeMachine: Personalized Facial Age Transformation

🤗 Paper Upvotes: 8 | cs.CV Authors: Luchao Qi, Jiaye Wu, Bang Gong, Annie N. Wang, David W. Jacobs, Roni Sengupta Title: MyTimeMachine: Personalized Facial Age Transformation Arxiv: http://arxiv.org/abs/2411.14521v1 Abstract: Facial aging is a complex process, highly dependent on multiple factors like gender, ethnicity, lifestyle, etc., making it extremely challenging to learn a global aging prior to predict aging for any individual accurately. Existing techniques often produce realistic and pl...

Nov 26, 2024•22 min•Ep. 131

Novel View Extrapolation with Video Diffusion Priors

🤗 Paper Upvotes: 7 | cs.CV Authors: Kunhao Liu, Ling Shao, Shijian Lu Title: Novel View Extrapolation with Video Diffusion Priors Arxiv: http://arxiv.org/abs/2411.14208v1 Abstract: The field of novel view synthesis has made significant strides thanks to the development of radiance field methods. However, most radiance field techniques are far better at novel view interpolation than novel view extrapolation where the synthesis novel views are far beyond the observed training views. We design Vie...

Nov 26, 2024•21 min•Ep. 130

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

🤗 Paper Upvotes: 42 | cs.CL, cs.CV Authors: Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, Jifeng Dai Title: Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Arxiv: http://arxiv.org/abs/2411.10442v1 Abstract: Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these mode...

Nov 23, 2024•19 min•Ep. 129

Multimodal Autoregressive Pre-training of Large Vision Encoders

🤗 Paper Upvotes: 23 | cs.CV, cs.LG Authors: Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby Title: Multimodal Autoregressive Pre-training of Large Vision Encoders Arxiv: http://arxiv.org/abs/2411.14402v1 Abstract: We introduce a novel method for pre-training of large-scale vision encode...

Nov 23, 2024•24 min•Ep. 128

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

🤗 Paper Upvotes: 23 | cs.CL Authors: Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang Title: Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions Arxiv: http://arxiv.org/abs/2411.14405v1 Abstract: Currently OpenAI o1 has sparked a surge of interest in the study of large reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding -- w...

Nov 23, 2024•19 min•Ep. 127

Hymba: A Hybrid-head Architecture for Small Language Models

🤗 Paper Upvotes: 20 | cs.CL, cs.AI, cs.LG Authors: Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov Title: Hymba: A Hybrid-head Architecture for Small Language Models Arxiv: http://arxiv.org/abs/2411.13676v1 Abstract: We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer atte...

Nov 23, 2024•24 min•Ep. 126

Natural Language Reinforcement Learning

🤗 Paper Upvotes: 15 | cs.LG, cs.AI, cs.CL Authors: Xidong Feng, Ziyu Wan, Haotian Fu, Bo Liu, Mengyue Yang, Girish A. Koushik, Zhiyuan Hu, Ying Wen, Jun Wang Title: Natural Language Reinforcement Learning Arxiv: http://arxiv.org/abs/2411.14251v1 Abstract: Reinforcement Learning (RL) mathematically formulates decision-making with Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable breakthroughs across various domains, including games, robotics, and language models. Thi...

Nov 23, 2024•24 min•Ep. 125

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

🤗 Paper Upvotes: 15 | cs.CL, cs.AI, cs.DL, cs.IR, cs.LG Authors: Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen-tau Yih, Pang Wei Koh, Hannaneh Hajishirzi Title: OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs Arxiv: http...

Nov 23, 2024•23 min•Ep. 124

Ultra-Sparse Memory Network

🤗 Paper Upvotes: 14 | cs.LG Authors: Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, Xun Zhou Title: Ultra-Sparse Memory Network Arxiv: http://arxiv.org/abs/2411.12364v1 Abstract: It is widely acknowledged that the performance of Transformer models is exponentially related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from computational complexity, they still face challenges in inference...

Nov 23, 2024•20 min•Ep. 123

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

🤗 Paper Upvotes: 10 | cs.CV Authors: Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, Ziwei Liu Title: Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Arxiv: http://arxiv.org/abs/2411.14432v1 Abstract: Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like OpenAI o1. Despite various efforts to improve LLM reasonin...

Nov 23, 2024•24 min•Ep. 122

Stable Flow: Vital Layers for Training-Free Image Editing

🤗 Paper Upvotes: 7 | cs.CV, cs.GR, cs.LG Authors: Omri Avrahami, Or Patashnik, Ohad Fried, Egor Nemchinov, Kfir Aberman, Dani Lischinski, Daniel Cohen-Or Title: Stable Flow: Vital Layers for Training-Free Image Editing Arxiv: http://arxiv.org/abs/2411.14430v1 Abstract: Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training...

Nov 23, 2024•23 min•Ep. 121

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

🤗 Paper Upvotes: 6 | cs.CL, cs.AI, cs.LG Authors: Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda Title: Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Arxiv: http://arxiv.org/abs/2411.14257v1 Abstract: Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability too...

Nov 23, 2024•22 min•Ep. 120

SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

🤗 Paper Upvotes: 35 | cs.LG, cs.AI, cs.CV, cs.NE, cs.PF Authors: Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen Title: SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration Arxiv: http://arxiv.org/abs/2411.10958v1 Abstract: Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. SageAttention utilizes 8-bit matrix multiplication, 16-bit matrix multip...

Nov 22, 2024•23 min•Ep. 119

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

🤗 Paper Upvotes: 23 | cs.CV Authors: Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu Title: VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models Arxiv: http://arxiv.org/abs/2411.13503v1 Abstract: Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A c...

Nov 22, 2024•25 min•Ep. 118

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

🤗 Paper Upvotes: 14 | cs.CV, cs.AI, cs.CL, cs.MM Authors: Ziyang Luo, Haoning Wu, Dongxu Li, Jing Ma, Mohan Kankanhalli, Junnan Li Title: VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation Arxiv: http://arxiv.org/abs/2411.13281v1 Abstract: Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice ...

Nov 22, 2024•24 min•Ep. 117

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

🤗 Paper Upvotes: 12 | cs.CV Authors: Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang Title: SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory Arxiv: http://arxiv.org/abs/2411.11922v1 Abstract: The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding obj...

Nov 22, 2024•22 min•Ep. 116

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

🤗 Paper Upvotes: 9 | cs.AI Authors: Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su Title: Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents Arxiv: http://arxiv.org/abs/2411.06559v1 Abstract: Language agents have demonstrated promising capabilities in automating web-based tasks, though their current reactive approaches still underperform largely compared to humans. While incorporating advanced p...

Nov 22, 2024•22 min•Ep. 115

When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

🤗 Paper Upvotes: 7 | cs.CL Authors: Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang Title: When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training Arxiv: http://arxiv.org/abs/2411.13476v1 Abstract: Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding pro...

Nov 22, 2024•25 min•Ep. 114

Stylecodes: Encoding Stylistic Information For Image Generation

🤗 Paper Upvotes: 6 | cs.CV Authors: Ciara Rowles Title: Stylecodes: Encoding Stylistic Information For Image Generation Arxiv: http://arxiv.org/abs/2411.12811v1 Abstract: Diffusion models excel in image generation, but controlling them remains a challenge. We focus on the problem of style-conditioned image generation. Although example images work, they are cumbersome: srefs (style-reference codes) from MidJourney solve this issue by expressing a specific image style in a short numeric code. The...

Nov 22, 2024•21 min•Ep. 113

ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

🤗 Paper Upvotes: 3 | cs.CV, cs.AI Authors: Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das Title: ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models Arxiv: http://arxiv.org/abs/2411.10867v1 Abstract: Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video underst...

Nov 22, 2024•24 min•Ep. 112

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android