Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Learning Flow Fields in Attention for Controllable Person Image Generation

🤗 Upvotes: 16 | cs.CV Authors: Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan-Manuel Pérez-Rúa, Aditya Patel, Tao Xiang, Miaojing Shi, Sen He Title: Learning Flow Fields in Attention for Controllable Person Image Generation Arxiv: http://arxiv.org/abs/2412.08486v2 Abstract: Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose...

Dec 13, 2024•21 min•Ep. 201

StyleMaster: Stylize Your Video with Artistic Generation and Translation

🤗 Upvotes: 14 | cs.CV Authors: Zixuan Ye, Huijuan Huang, Xintao Wang, Pengfei Wan, Di Zhang, Wenhan Luo Title: StyleMaster: Stylize Your Video with Artistic Generation and Translation Arxiv: http://arxiv.org/abs/2412.07744v1 Abstract: Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our first observation is that the style extraction stage mat...

Dec 13, 2024•23 min•Ep. 200

StreamChat: Chatting with Streaming Video

🤗 Upvotes: 12 | cs.CV Authors: Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, Jose M. Alvare Title: StreamChat: Chatting with Streaming Video Arxiv: http://arxiv.org/abs/2412.08646v1 Abstract: This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment ...

Dec 13, 2024•20 min•Ep. 199

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

🤗 Upvotes: 11 | cs.CV Authors: Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Alan Yuille, Jieneng Chen Title: 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark Arxiv: http://arxiv.org/abs/2412.07825v1 Abstract: 3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range...

Dec 13, 2024•25 min•Ep. 198

Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction

🤗 Upvotes: 11 | cs.CV, cs.GR Authors: Seungtae Nam, Xiangyu Sun, Gyeongjin Kang, Younggeun Lee, Seungjun Oh, Eunbyung Park Title: Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction Arxiv: http://arxiv.org/abs/2412.06234v2 Abstract: Generalized feed-forward Gaussian models have achieved significant progress in sparse-view 3D reconstruction by leveraging prior knowledge from large multi-view datasets. However, these models often struggle to r...

Dec 13, 2024•23 min•Ep. 197

The BrowserGym Ecosystem for Web Agent Research

🤗 Upvotes: 11 | cs.LG, cs.AI, cs.SE Authors: Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste Title: The BrowserGym Ecosystem for Web Agent Research Arxiv: http://arxiv.org/abs/2412.05467v3 Abstract: The BrowserGym ecos...

Dec 13, 2024•25 min•Ep. 196

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

🤗 Upvotes: 31 | cs.CV Authors: Jianzong Wu, Chao Tang, Jingbo Wang, Yanhong Zeng, Xiangtai Li, Yunhai Tong Title: DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation Arxiv: http://arxiv.org/abs/2412.07589v1 Abstract: Story visualization, the task of creating visual narratives from textual descriptions, has seen progress with text-to-image generation models. However, these models often lack effective control over character appearances and interactions, part...

Dec 12, 2024•22 min•Ep. 195

Hidden in the Noise: Two-Stage Robust Watermarking for Images

🤗 Upvotes: 20 | cs.CV, cs.AI, cs.LG Authors: Kasra Arabi, Benjamin Feuer, R. Teal Witter, Chinmay Hegde, Niv Cohen Title: Hidden in the Noise: Two-Stage Robust Watermarking for Images Arxiv: http://arxiv.org/abs/2412.04653v2 Abstract: As the quality of image generators continues to improve, deepfakes become a topic of considerable societal debate. Image watermarking allows responsible model owners to detect and label their AI-generated content, which can mitigate the harm. Yet, current state-of...

Dec 12, 2024•21 min•Ep. 194

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models

🤗 Upvotes: 19 | cs.CV Authors: Tong Wu, Yinghao Xu, Ryan Po, Mengchen Zhang, Guandao Yang, Jiaqi Wang, Ziwei Liu, Dahua Lin, Gordon Wetzstein Title: FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models Arxiv: http://arxiv.org/abs/2412.07674v1 Abstract: Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications. However, accurately describing desired visual attributes can be challenging, especially for non-expe...

Dec 12, 2024•20 min•Ep. 193

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

🤗 Upvotes: 18 | cs.CV Authors: Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, Hui Ding, Zhe Lin, Hengshuang Zhao Title: UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics Arxiv: http://arxiv.org/abs/2412.07774v1 Abstract: We introduce UniReal, a unified framework designed to address various image generation and editing tasks. Existing solutions often vary by tasks, yet share fundamental princi...

Dec 12, 2024•24 min•Ep. 192

3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation

🤗 Upvotes: 17 | cs.CV Authors: Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, Dahua Lin Title: 3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation Arxiv: http://arxiv.org/abs/2412.07759v1 Abstract: This paper aims to manipulate multi-entity 3D motions in video generation. Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions and have achieved rema...

Dec 12, 2024•24 min•Ep. 191

Mobile Video Diffusion

🤗 Upvotes: 16 | cs.CV, cs.AI Authors: Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir Ghodrati, Amirhossein Habibian Title: Mobile Video Diffusion Arxiv: http://arxiv.org/abs/2412.07583v1 Abstract: Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video ...

Dec 12, 2024•25 min•Ep. 190

Granite Guardian

🤗 Upvotes: 16 | cs.CL Authors: Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri Title: Granite Guardian Arxiv: http://arxiv.org/abs/2412.07724v1 Abstract: We introduce ...

Dec 12, 2024•21 min•Ep. 189

Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation

🤗 Upvotes: 54 | cs.LG, cs.AI Authors: Egor Cherepanov, Nikita Kachaev, Artem Zholus, Alexey K. Kovalev, Aleksandr I. Panov Title: Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation Arxiv: http://arxiv.org/abs/2412.06531v1 Abstract: The incorporation of memory into agents is essential for numerous tasks within the domain of Reinforcement Learning (RL). In particular, memory is paramount for tasks that require the utilization of past information, adapt...

Dec 11, 2024•19 min•Ep. 188

ProcessBench: Identifying Process Errors in Mathematical Reasoning

🤗 Upvotes: 38 | cs.AI, cs.CL, cs.LG Authors: Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin Title: ProcessBench: Identifying Process Errors in Mathematical Reasoning Arxiv: http://arxiv.org/abs/2412.06559v2 Abstract: As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we in...

Dec 11, 2024•21 min•Ep. 187

Training Large Language Models to Reason in a Continuous Latent Space

🤗 Upvotes: 25 | cs.CL Authors: Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian Title: Training Large Language Models to Reason in a Continuous Latent Space Arxiv: http://arxiv.org/abs/2412.06769v1 Abstract: Large language models (LLMs) are restricted to reason in the "language space", where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. However, we argue that language space may not alwa...

Dec 11, 2024•22 min•Ep. 186

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

🤗 Upvotes: 10 | cs.CV Authors: Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan Title: Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation Arxiv: http://arxiv.org/abs/2412.04432v1 Abstract: In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tok...

Dec 11, 2024•24 min•Ep. 185

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

🤗 Upvotes: 9 | cs.CV, cs.LG Authors: Nicolas Dufour, David Picard, Vicky Kalogeiton, Loic Landrieu Title: Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation Arxiv: http://arxiv.org/abs/2412.06781v1 Abstract: Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook t...

Dec 11, 2024•22 min•Ep. 184

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

🤗 Upvotes: 8 | cs.CV, cs.CL, cs.LG Authors: Xiao Xu, Tianhao Niu, Yuxi Xie, Libo Qin, Wanxiang Che, Min-Yen Kan Title: Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models Arxiv: http://arxiv.org/abs/2412.05939v1 Abstract: Multimodal Large Language Models (MLLMs) excel in vision--language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and ...

Dec 11, 2024•23 min•Ep. 183

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

🤗 Upvotes: 7 | cs.CV Authors: Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, Xinlong Wang Title: You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale Arxiv: http://arxiv.org/abs/2412.06699v1 Abstract: Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In th...

Dec 11, 2024•20 min•Ep. 182

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

🤗 Upvotes: 7 | cs.CV, cs.AI, cs.IR Authors: Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He Title: OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations Arxiv: http://arxiv.org/abs/2412.07626v1 Abstract: Document content extraction is crucial in computer vision, especially for meeti...

Dec 11, 2024•20 min•Ep. 181

Robust Multi-bit Text Watermark with LLM-based Paraphrasers

🤗 Upvotes: 5 | cs.AI Authors: Xiaojun Xu, Jinghan Jia, Yuanshun Yao, Yang Liu, Hang Li Title: Robust Multi-bit Text Watermark with LLM-based Paraphrasers Arxiv: http://arxiv.org/abs/2412.03123v1 Abstract: We propose an imperceptible multi-bit text watermark embedded by paraphrasing with LLMs. We fine-tune a pair of LLM paraphrasers that are designed to behave differently so that their paraphrasing difference reflected in the text semantics can be identified by a trained decoder. To embed our mu...

Dec 11, 2024•18 min•Ep. 180

MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views

🤗 Upvotes: 4 | cs.CV, cs.GR Authors: Antoine Guédon, Tomoki Ichikawa, Kohei Yamashita, Ko Nishino Title: MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views Arxiv: http://arxiv.org/abs/2412.06767v1 Abstract: We present a novel appearance model that simultaneously realizes explicit high-quality 3D surface mesh recovery and photorealistic novel view synthesis from sparse view samples. Our key idea is to model the underlying scene geometry Mesh as an Atla...

Dec 11, 2024•22 min•Ep. 179

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

🤗 Upvotes: 33 | cs.CV Authors: Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, Hao Li Title: LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment Arxiv: http://arxiv.org/abs/2412.04814v1 Abstract: Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult ...

Dec 10, 2024•20 min•Ep. 178

EXAONE 3.5: Series of Large Language Models for Real-world Use Cases

🤗 Upvotes: 31 | cs.CL Authors: LG AI Research, Soyoung An, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Hyeongu Yun Title:...

Dec 10, 2024•22 min•Ep. 177

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

🤗 Upvotes: 30 | cs.CL, cs.CV Authors: Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, Xiang Yue Title: MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Arxiv: http://arxiv.org/abs/2412.05237v1 Abstract: Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning dat...

Dec 10, 2024•22 min•Ep. 176

APOLLO: SGD-like Memory, AdamW-level Performance

🤗 Upvotes: 27 | cs.LG, cs.AI, cs.PF Authors: Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee Title: APOLLO: SGD-like Memory, AdamW-level Performance Arxiv: http://arxiv.org/abs/2412.05270v2 Abstract: Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting tr...

Dec 10, 2024•20 min•Ep. 175

SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

🤗 Upvotes: 19 | cs.CV Authors: Trong-Tung Nguyen, Quang Nguyen, Khoi Nguyen, Anh Tran, Cuong Pham Title: SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion Arxiv: http://arxiv.org/abs/2412.04301v2 Abstract: Recent advances in text-guided image editing enable users to perform image edits through simple text inputs, leveraging the extensive priors of multi-step diffusion-based text-to-image models. However, these methods often fall short of the speed demands required for r...

Dec 10, 2024•20 min•Ep. 174

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

🤗 Upvotes: 18 | cs.RO, cs.AI, cs.CL, cs.CV, cs.LG Authors: Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, Xihui Liu Title: Moto: Latent Motion Token as the Bridging Language for Robot Manipulation Arxiv: http://arxiv.org/abs/2412.04445v1 Abstract: Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which ...

Dec 10, 2024•20 min•Ep. 173

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

🤗 Upvotes: 13 | cs.CV Authors: Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu Title: GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration Arxiv: http://arxiv.org/abs/2412.04440v1 Abstract: Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associa...

Dec 10, 2024•23 min•Ep. 172

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android