Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation

🤗 Upvotes: 25 | cs.CL, cs.AI, cs.IR Authors: Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yongkang Wu, Zhonghua Li, Qi Ye, Zhicheng Dou Title: RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation Arxiv: http://arxiv.org/abs/2412.11919v1 Abstract: Large language models (LLMs) exhibit remarkable generative capabilities but often suffer from hallucinations. Retrieval-augmented generation (RAG) offers an effective solution by incorporating external knowledge, but exi...

Dec 18, 2024•22 min•Ep. 231

Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models

🤗 Upvotes: 25 | cs.CV, cs.AI, cs.CL Authors: Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, Ziwei Liu Title: Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models Arxiv: http://arxiv.org/abs/2412.09645v2 Abstract: Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the proces...

Dec 18, 2024•21 min•Ep. 230

BrushEdit: All-In-One Image Inpainting and Editing

🤗 Upvotes: 24 | cs.CV, cs.AI Authors: Yaowei Li, Yuxuan Bian, Xuan Ju, Zhaoyang Zhang, Ying Shan, Yuexian Zou, Qiang Xu Title: BrushEdit: All-In-One Image Inpainting and Editing Arxiv: http://arxiv.org/abs/2412.10316v2 Abstract: Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structur...

Dec 18, 2024•28 min•Ep. 229

ColorFlow: Retrieval-Augmented Image Sequence Colorization

🤗 Upvotes: 20 | cs.CV Authors: Junhao Zhuang, Xuan Ju, Zhaoyang Zhang, Yong Liu, Shiyi Zhang, Chun Yuan, Ying Shan Title: ColorFlow: Retrieval-Augmented Image Sequence Colorization Arxiv: http://arxiv.org/abs/2412.11815v1 Abstract: Automatic black-and-white image sequence colorization while preserving character and object identity (ID) is a complex task with significant market demand, such as in cartoon or comic series colorization. Despite advancements in visual colorization using large-scale ...

Dec 18, 2024•23 min•Ep. 228

Smaller Language Models Are Better Instruction Evolvers

🤗 Upvotes: 16 | cs.CL Authors: Tingfeng Hui, Lulu Zhao, Guanting Dong, Yaqi Zhang, Hua Zhou, Sen Su Title: Smaller Language Models Are Better Instruction Evolvers Arxiv: http://arxiv.org/abs/2412.11231v1 Abstract: Instruction tuning has been widely used to unleash the complete potential of large language models. Notably, complex and diverse instructions are of significant importance as they can effectively align models with various downstream tasks. However, current approaches to constructing l...

Dec 18, 2024•23 min•Ep. 227

Causal Diffusion Transformers for Generative Modeling

🤗 Upvotes: 16 | cs.CV Authors: Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, Haoqi Fan Title: Causal Diffusion Transformers for Generative Modeling Arxiv: http://arxiv.org/abs/2412.12095v2 Abstract: We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. It is a next-token(s) forecasting framework that is friendly to both discrete and continuous modalities and compatible with existing next-token prediction models like LLaMA and GPT. While recent works attemp...

Dec 18, 2024•24 min•Ep. 226

SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

🤗 Upvotes: 11 | cs.CL, cs.AI, cs.LG Authors: Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu, Dan Zhang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang Title: SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models Arxiv: http://arxiv.org/abs/2412.11605v1 Abstract: Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately r...

Dec 18, 2024•23 min•Ep. 225

IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

🤗 Upvotes: 11 | cs.CV Authors: Zhibing Li, Tong Wu, Jing Tan, Mengchen Zhang, Jiaqi Wang, Dahua Lin Title: IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations Arxiv: http://arxiv.org/abs/2412.12083v1 Abstract: Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics. Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and env...

Dec 18, 2024•20 min•Ep. 224

GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs

🤗 Upvotes: 10 | cs.RO, cs.AI, cs.CV Authors: Xinli Xu, Wenhang Ge, Dicong Qiu, ZhiFei Chen, Dongyu Yan, Zhuoyun Liu, Haoyu Zhao, Hanfeng Zhao, Shunsi Zhang, Junwei Liang, Ying-Cong Chen Title: GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs Arxiv: http://arxiv.org/abs/2412.11258v1 Abstract: Estimating physical properties for visual data is a crucial task in computer vision, graphics, and robotics, underpinning applications such as augmented reality, physical simulati...

Dec 18, 2024•21 min•Ep. 223

Apollo: An Exploration of Video Understanding in Large Multimodal Models

🤗 Upvotes: 91 | cs.CV, cs.AI Authors: Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia Title: Apollo: An Exploration of Video Understanding in Large Multimodal Models Arxiv: http://arxiv.org/abs/2412.10360v1 Abstract: Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understandin...

Dec 17, 2024•25 min•Ep. 222

GenEx: Generating an Explorable World

🤗 Upvotes: 65 | cs.CV, cs.RO Authors: Taiming Lu, Tianmin Shu, Junfei Xiao, Luoxin Ye, Jiahao Wang, Cheng Peng, Chen Wei, Daniel Khashabi, Rama Chellappa, Alan Yuille, Jieneng Chen Title: GenEx: Generating an Explorable World Arxiv: http://arxiv.org/abs/2412.09624v1 Abstract: Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing GenEx, a sys...

Dec 17, 2024•21 min•Ep. 221

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

🤗 Upvotes: 29 | cs.CV Authors: Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai Title: SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding Arxiv: http://arxiv.org/abs/2412.09604v1 Abstract: The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Rece...

Dec 17, 2024•25 min•Ep. 220

BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities

🤗 Upvotes: 24 | cs.CV Authors: Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Sara Pieri, Saeed Yahya Alseiari, Shanavas Cholakkal, Khaled Aldahmani, Fahad Khan, Rao Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal Title: BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities Arxiv: http://arxiv.org/abs/2412.07769v1 Abstract: This paper introduces BiMediX2, a bilingual (Arabic-English) Bio-Medical EXpert Large Multimodal Model (LMM) with a unified architecture that integrates te...

Dec 17, 2024•18 min•Ep. 219

Large Action Models: From Inception to Implementation

🤗 Upvotes: 23 | cs.AI Authors: Lu Wang, Fangkai Yang, Chaoyun Zhang, Junting Lu, Jiaxu Qian, Shilin He, Pu Zhao, Bo Qiao, Ray Huang, Si Qin, Qisheng Su, Jiayi Ye, Yudi Zhang, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang Title: Large Action Models: From Inception to Implementation Arxiv: http://arxiv.org/abs/2412.10047v1 Abstract: As AI continues to advance, there is a growing demand for systems that go beyond language-based assistance and move toward intelligent agents...

Dec 17, 2024•22 min•Ep. 218

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

🤗 Upvotes: 17 | cs.CV, cs.AI Authors: Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, Ying Tai Title: InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption Arxiv: http://arxiv.org/abs/2412.09283v1 Abstract: Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. H...

Dec 17, 2024•21 min•Ep. 217

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

🤗 Upvotes: 13 | cs.CV Authors: Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu Title: FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion Arxiv: http://arxiv.org/abs/2412.09626v1 Abstract: Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate hi...

Dec 17, 2024•22 min•Ep. 216

ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation

🤗 Upvotes: 10 | cs.CV Authors: Daniel Winter, Asaf Shul, Matan Cohen, Dana Berman, Yael Pritch, Alex Rav-Acha, Yedid Hoshen Title: ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation Arxiv: http://arxiv.org/abs/2412.08645v1 Abstract: This paper introduces a tuning-free method for both object insertion and subject-driven generation. The task involves composing an object, given multiple views, into a scene specified by either an image or text. Existing methods strugg...

Dec 17, 2024•22 min•Ep. 215

FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing

🤗 Upvotes: 8 | cs.CV Authors: Yingying Deng, Xiangyu He, Changwang Mei, Peisong Wang, Fan Tang Title: FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing Arxiv: http://arxiv.org/abs/2412.07517v1 Abstract: Though Rectified Flows (ReFlows) with distillation offers a promising way for fast sampling, its fast inversion transforms images back to structured noise for recovery and following editing remains unsolved. This paper introduces FireFlow, a simple yet effective zero-shot app...

Dec 17, 2024•22 min•Ep. 214

FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

🤗 Upvotes: 7 | cs.CV Authors: Yusuf Dalva, Kavana Venkatesh, Pinar Yanardag Title: FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers Arxiv: http://arxiv.org/abs/2412.09611v1 Abstract: Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, rectified flow models often struggle with disentangled editing of images. This limitati...

Dec 17, 2024•19 min•Ep. 213

Phi-4 Technical Report

🤗 Upvotes: 40 | cs.CL, cs.AI Authors: Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, Yi Zhang Title: Phi-4 Technical Report Arxiv: http://arxiv.org/abs/2412.08905v1 Abstract: ...

Dec 14, 2024•22 min•Ep. 212

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

🤗 Upvotes: 30 | cs.CV, cs.AI, cs.CL Authors: Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger Title: Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions Arxiv: http://arxiv.org/abs/2412.08737v1 Abstract: Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP) -- particularly the ability to accurately describe the geometric details of an image. This capabi...

Dec 14, 2024•24 min•Ep. 211

Multimodal Latent Language Modeling with Next-Token Diffusion

🤗 Upvotes: 21 | cs.CL, cs.CV, cs.LG Authors: Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei Title: Multimodal Latent Language Modeling with Next-Token Diffusion Arxiv: http://arxiv.org/abs/2412.08635v1 Abstract: Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly ...

Dec 14, 2024•23 min•Ep. 210

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

🤗 Upvotes: 17 | cs.CV Authors: Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li Title: EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM Arxiv: http://arxiv.org/abs/2412.09618v1 Abstract: Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging their image embeddings as the injection condition, b...

Dec 14, 2024•22 min•Ep. 209

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

🤗 Upvotes: 16 | cs.CL Authors: Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu Title: AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials Arxiv: http://arxiv.org/abs/2412.09605v1 Abstract: Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-...

Dec 14, 2024•19 min•Ep. 208

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

🤗 Upvotes: 14 | cs.CV Authors: Dongting Hu, Jierun Chen, Xijie Huang, Huseyin Coskun, Arpit Sahni, Aarush Gupta, Anujraaj Goyal, Dishani Lahiri, Rajesh Singh, Yerlan Idelbayev, Junli Cao, Yanyu Li, Kwang-Ting Cheng, S. -H. Gary Chan, Mingming Gong, Sergey Tulyakov, Anil Kag, Yanwu Xu, Jian Ren Title: SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training Arxiv: http://arxiv.org/abs/2412.09619v1 Abstract: Existing text-to-image (T2I) dif...

Dec 14, 2024•19 min•Ep. 207

Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion

🤗 Upvotes: 13 | cs.CV Authors: Zexin He, Tengfei Wang, Xin Huang, Xingang Pan, Ziwei Liu Title: Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion Arxiv: http://arxiv.org/abs/2412.09593v1 Abstract: Recovering the geometry and materials of objects from a single image is challenging due to its under-constrained nature. In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighti...

Dec 14, 2024•23 min•Ep. 206

JuStRank: Benchmarking LLM Judges for System Ranking

🤗 Upvotes: 9 | cs.CL, cs.AI, cs.LG Authors: Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, Asaf Yehudai Title: JuStRank: Benchmarking LLM Judges for System Ranking Arxiv: http://arxiv.org/abs/2412.09569v1 Abstract: Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution...

Dec 14, 2024•21 min•Ep. 205

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

🤗 Upvotes: 36 | cs.CV Authors: Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, Di Zhang Title: SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints Arxiv: http://arxiv.org/abs/2412.07760v1 Abstract: Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency. This progress inspires us to investigate the potential of these models to ens...

Dec 13, 2024•21 min•Ep. 204

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

🤗 Upvotes: 28 | cs.CV Authors: Zejian Li, Chenye Meng, Yize Li, Ling Yang, Shengyuan Zhang, Jiarui Ma, Jiayi Li, Guang Yang, Changyuan Yang, Zhiyuan Yang, Jinxiong Chang, Lingyun Sun Title: LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations Arxiv: http://arxiv.org/abs/2412.08580v1 Abstract: Recent advances in text-to-image (T2I) generation have shown remarkable success in producing high-quality images from text. However, existing T2I mod...

Dec 13, 2024•21 min•Ep. 203

POINTS1.5: Building a Vision-Language Model towards Real World Applications

🤗 Upvotes: 25 | cs.CV, cs.MM Authors: Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, Jie Zhou Title: POINTS1.5: Building a Vision-Language Model towards Real World Applications Arxiv: http://arxiv.org/abs/2412.08443v1 Abstract: Vision-language models have made significant strides recently, demonstrating superior performance across a range of tasks, e.g. optical character recognition and complex diagram analysis. Building on this trend, we introduce a new vision-language model, POIN...

Dec 13, 2024•24 min•Ep. 202

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android