Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction

🤗 Upvotes: 12 | cs.CV Authors: Jixuan Fan, Wanhua Li, Yifei Han, Yansong Tang Title: Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction Arxiv: http://arxiv.org/abs/2412.04887v1 Abstract: 3D Gaussian Splatting has demonstrated notable success in large-scale scene reconstruction, but challenges persist due to high training memory consumption and storage overhead. Hybrid representations that integrate implicit and explicit features offer a way to mitigate ...

Dec 10, 2024•21 min•Ep. 171

CompCap: Improving Multimodal Large Language Models with Composite Captions

🤗 Upvotes: 11 | cs.CV, cs.AI, cs.LG Authors: Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He Title: CompCap: Improving Multimodal Large Language Models with Composite Captions Arxiv: http://arxiv.org/abs/2412.05243v1 Abstract: How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual element...

Dec 10, 2024•22 min•Ep. 170

VisionZip: Longer is Better but Not Necessary in Vision Language Models

🤗 Upvotes: 83 | cs.CV, cs.AI, cs.CL, cs.LG Authors: Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia Title: VisionZip: Longer is Better but Not Necessary in Vision Language Models Arxiv: http://arxiv.org/abs/2412.04467v1 Abstract: Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the...

Dec 08, 2024•22 min•Ep. 169

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

🤗 Upvotes: 46 | cs.CV, cs.AI Authors: Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, Bin Xiao Title: Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Arxiv: http://arxiv.org/abs/2412.04424v1 Abstract: We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style v...

Dec 08, 2024•19 min•Ep. 168

NVILA: Efficient Frontier Visual Language Models

🤗 Upvotes: 36 | cs.CV Authors: Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu Title: NVILA: Efficient Frontier Visual Language Models Arxiv: http://arxiv.org/abs/2412.04468v1 Abstract: Visual language mod...

Dec 08, 2024•20 min•Ep. 167

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

🤗 Upvotes: 32 | cs.CL Authors: Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong Title: Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction Arxiv: http://arxiv.org/abs/2412.04454v1 Abstract: Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating GUI tasks remains challenging due to the complexity and variability of visual environments. Existing approaches often rely on textual representat...

Dec 08, 2024•21 min•Ep. 166

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

🤗 Upvotes: 32 | cs.RO, cs.AI, cs.CV, cs.LG Authors: Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, He Wang Title: Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection Arxiv: http://arxiv.org/abs/2412.04455v1 Abstract: Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively ...

Dec 08, 2024•23 min•Ep. 165

Evaluating Language Models as Synthetic Data Generators

🤗 Upvotes: 30 | cs.CL Authors: Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig Title: Evaluating Language Models as Synthetic Data Generators Arxiv: http://arxiv.org/abs/2412.03679v1 Abstract: Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While...

Dec 08, 2024•21 min•Ep. 164

A Noise is Worth Diffusion Guidance

🤗 Upvotes: 25 | cs.CV, cs.AI, cs.LG Authors: Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Jaewon Min, Minjae Kim, Wooseok Jang, Hyoungwon Cho, Sayak Paul, SeonHwa Kim, Eunju Cha, Kyong Hwan Jin, Seungryong Kim Title: A Noise is Worth Diffusion Guidance Arxiv: http://arxiv.org/abs/2412.03895v1 Abstract: Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are gu...

Dec 08, 2024•21 min•Ep. 163

Structured 3D Latents for Scalable and Versatile 3D Generation

🤗 Upvotes: 22 | cs.CV Authors: Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, Jiaolong Yang Title: Structured 3D Latents for Scalable and Versatile 3D Generation Arxiv: http://arxiv.org/abs/2412.01506v1 Abstract: We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields,...

Dec 08, 2024•24 min•Ep. 162

Negative Token Merging: Image-based Adversarial Feature Guidance

🤗 Upvotes: 21 | cs.CV, cs.AI, cs.GR, cs.LG, stat.ML Authors: Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer Title: Negative Token Merging: Image-based Adversarial Feature Guidance Arxiv: http://arxiv.org/abs/2412.01339v2 Abstract: Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to steer diffusion models away from producing undesired concepts. While u...

Dec 08, 2024•20 min•Ep. 161

MV-Adapter: Multi-view Consistent Image Generation Made Easy

🤗 Upvotes: 17 | cs.CV Authors: Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, Lu Sheng Title: MV-Adapter: Multi-view Consistent Image Generation Made Easy Arxiv: http://arxiv.org/abs/2412.03632v1 Abstract: Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) de...

Dec 08, 2024•21 min•Ep. 160

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

🤗 Paper Upvotes: 48 | cs.CV, cs.AI, cs.CL, cs.HC Authors: Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou Title: ShowUI: One Vision-Language-Action Model for GUI Visual Agent Arxiv: http://arxiv.org/abs/2411.17465v1 Abstract: Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-r...

Nov 28, 2024•25 min•Ep. 159

Star Attention: Efficient LLM Inference over Long Sequences

🤗 Paper Upvotes: 32 | cs.CL, cs.AI, cs.LG Authors: Shantanu Acharya, Fei Jia, Boris Ginsburg Title: Star Attention: Efficient LLM Inference over Long Sequences Arxiv: http://arxiv.org/abs/2411.17116v1 Abstract: Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding ...

Nov 28, 2024•21 min•Ep. 158

Pathways on the Image Manifold: Image Editing via Video Generation

🤗 Paper Upvotes: 23 | cs.CV, cs.AI, cs.LG Authors: Noam Rotstein, Gal Yona, Daniel Silver, Roy Velich, David Bensaïd, Ron Kimmel Title: Pathways on the Image Manifold: Image Editing via Video Generation Arxiv: http://arxiv.org/abs/2411.16819v1 Abstract: Recent advances in image editing, driven by image diffusion models, have shown remarkable progress. However, significant challenges remain, as these models often struggle to follow complex edit instructions accurately and frequently compromise f...

Nov 28, 2024•25 min•Ep. 157

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

🤗 Paper Upvotes: 15 | cs.CV, cs.AI, cs.CL Authors: Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, Caifeng Shan, Ran He Title: MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs Arxiv: http://arxiv.org/abs/2411.15296v1 Abstract: As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. Building ...

Nov 28, 2024•26 min•Ep. 156

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

🤗 Paper Upvotes: 14 | cs.CV Authors: Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang Title: Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration Arxiv: http://arxiv.org/abs/2411.17686v1 Abstract: To accelerate the inference of heavy Multimodal Large Language Models (MLLMs), this study rethinks the current landscape of training-free token reduction research. We regret to find that the critical components of ...

Nov 28, 2024•22 min•Ep. 155

SketchAgent: Language-Driven Sequential Sketch Generation

🤗 Paper Upvotes: 13 | cs.CV Authors: Yael Vinker, Tamar Rott Shaham, Kristine Zheng, Alex Zhao, Judith E Fan, Antonio Torralba Title: SketchAgent: Language-Driven Sequential Sketch Generation Arxiv: http://arxiv.org/abs/2411.17673v1 Abstract: Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and visual communication that spans various disciplines. While artificial systems have driven substantial advances in content creation and human-computer interaction, ...

Nov 28, 2024•25 min•Ep. 154

TEXGen: a Generative Diffusion Model for Mesh Textures

🤗 Paper Upvotes: 12 | cs.CV, cs.AI, cs.GR Authors: Xin Yu, Ze Yuan, Yuan-Chen Guo, Ying-Tian Liu, JianHui Liu, Yangguang Li, Yan-Pei Cao, Ding Liang, Xiaojuan Qi Title: TEXGen: a Generative Diffusion Model for Mesh Textures Arxiv: http://arxiv.org/abs/2411.14740v1 Abstract: While high-quality texture maps are essential for realistic 3D asset rendering, few studies have explored learning directly in the texture space, especially on large-scale datasets. In this work, we depart from the conventio...

Nov 28, 2024•24 min•Ep. 153

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

🤗 Paper Upvotes: 8 | cs.CV, cs.CL Authors: Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu Title: VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models Arxiv: http://arxiv.org/abs/2411.17451v1 Abstract: Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explor...

Nov 28, 2024•22 min•Ep. 152

Learning 3D Representations from Procedural 3D Programs

🤗 Paper Upvotes: 8 | cs.CV Authors: Xuweiyi Chen, Zezhou Cheng Title: Learning 3D Representations from Procedural 3D Programs Arxiv: http://arxiv.org/abs/2411.17467v1 Abstract: Self-supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds. Unlike 2D images, which are widely accessible, acquiring 3D assets requires specialized expertise or professional 3D scanning equipment, making it difficult to scale and raising copyr...

Nov 28, 2024•24 min•Ep. 151

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

🤗 Paper Upvotes: 7 | cs.CV Authors: Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, XIngang Pan Title: SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE Arxiv: http://arxiv.org/abs/2411.16856v1 Abstract: Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advan...

Nov 28, 2024•26 min•Ep. 150

Material Anything: Generating Materials for Any 3D Object via Diffusion

🤗 Paper Upvotes: 33 | cs.CV, cs.GR Authors: Xin Huang, Tengfei Wang, Ziwei Liu, Qing Wang Title: Material Anything: Generating Materials for Any 3D Object via Diffusion Arxiv: http://arxiv.org/abs/2411.15138v1 Abstract: We present Material Anything, a fully-automated, unified diffusion framework designed to generate physically-based materials for 3D objects. Unlike existing methods that rely on complex pipelines or case-specific optimizations, Material Anything offers a robust, end-to-end solut...

Nov 27, 2024•22 min•Ep. 149

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

🤗 Paper Upvotes: 28 | cs.CV Authors: Chaehun Shin, Jooyoung Choi, Heeseung Kim, Sungroh Yoon Title: Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator Arxiv: http://arxiv.org/abs/2411.15466v1 Abstract: Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- a...

Nov 27, 2024•27 min•Ep. 148

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

🤗 Paper Upvotes: 19 | cs.AI, cs.CL Authors: Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, Huan Liu Title: From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge Arxiv: http://arxiv.org/abs/2411.16594v1 Abstract: Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). However, traditi...

Nov 27, 2024•22 min•Ep. 147

O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?

🤗 Paper Upvotes: 18 | cs.CL, cs.AI Authors: Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, Pengfei Liu Title: O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? Arxiv: http://arxiv.org/abs/2411.16489v1 Abstract: This paper presents a critical examination of current approaches to replicating OpenAI's O1 model capabilities, with particular focus on the widespread but o...

Nov 27, 2024•21 min•Ep. 146

MH-MoE: Multi-Head Mixture-of-Experts

🤗 Paper Upvotes: 17 | cs.CL Authors: Shaohan Huang, Xun Wu, Shuming Ma, Furu Wei Title: MH-MoE: Multi-Head Mixture-of-Experts Arxiv: http://arxiv.org/abs/2411.16205v2 Abstract: Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with spars...

Nov 27, 2024•21 min•Ep. 145

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

🤗 Paper Upvotes: 15 | cs.CV Authors: Tianbin Li, Yanzhou Su, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun He Title: GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI Arxiv: http://arxiv.org/abs/2411.14522v1 Abstract: Despite significant advancements in general artificial intelligence, such as...

Nov 27, 2024•21 min•Ep. 144

DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

🤗 Paper Upvotes: 13 | cs.CV, cs.AI, cs.CL Authors: Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal Title: DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation Arxiv: http://arxiv.org/abs/2411.16657v1 Abstract: Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content...

Nov 27, 2024•23 min•Ep. 143

Knowledge Transfer Across Modalities with Natural Language Supervision

🤗 Paper Upvotes: 13 | cs.CV, 68T45 (Primary) 68T50 (Secondary), I.2.6 Authors: Carlo Alberto Barbano, Luca Molinaro, Emanuele Aiello, Marco Grangetto Title: Knowledge Transfer Across Modalities with Natural Language Supervision Arxiv: http://arxiv.org/abs/2411.15611v1 Abstract: We present a way to learn novel concepts by only using their textual description. We call this method Knowledge Transfer. Similarly to human perception, we leverage cross-modal interaction to introduce new concepts. We h...

Nov 27, 2024•21 min•Ep. 142

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android