Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Multi-LLM Text Summarization

🤗 Upvotes: 3 | cs.CL Authors: Jiangnan Fang, Cheng-Tse Liu, Jieun Kim, Yash Bhedaru, Ethan Liu, Nikhil Singh, Nedim Lipka, Puneet Mathur, Nesreen K. Ahmed, Franck Dernoncourt, Ryan A. Rossi, Hanieh Deilamsalehy Title: Multi-LLM Text Summarization Arxiv: http://arxiv.org/abs/2412.15487v1 Abstract: In this work, we propose a Multi-LLM summarization framework, and investigate two different multi-LLM strategies including centralized and decentralized. Our multi-LLM summarization framework has two f...

Dec 24, 2024•23 min•Ep. 261

Qwen2.5 Technical Report

🤗 Upvotes: 236 | cs.CL Authors: Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zh...

Dec 21, 2024•26 min•Ep. 260

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

🤗 Upvotes: 44 | cs.CV, cs.CL Authors: Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong Title: MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval Arxiv: http://arxiv.org/abs/2412.14475v1 Abstract: Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages...

Dec 21, 2024•23 min•Ep. 259

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

🤗 Upvotes: 23 | cs.CL, cs.AI Authors: Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li Title: LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks Arxiv: http://arxiv.org/abs/2412.15204v1 Abstract: This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across re...

Dec 21, 2024•23 min•Ep. 258

How to Synthesize Text Data without Model Collapse?

🤗 Upvotes: 19 | cs.CL, cs.AI, cs.LG Authors: Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou Title: How to Synthesize Text Data without Model Collapse? Arxiv: http://arxiv.org/abs/2412.14689v1 Abstract: Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web d...

Dec 21, 2024•24 min•Ep. 257

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

🤗 Upvotes: 17 | cs.CV Authors: Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, Mannat Singh Title: Flowing from Words to Pixels: A Framework for Cross-Modality Evolution Arxiv: http://arxiv.org/abs/2412.15213v1 Abstract: Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cros...

Dec 21, 2024•20 min•Ep. 256

Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion

🤗 Upvotes: 13 | cs.CV Authors: Jixuan He, Wanhua Li, Ye Liu, Junsik Kim, Donglai Wei, Hanspeter Pfister Title: Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion Arxiv: http://arxiv.org/abs/2412.14462v1 Abstract: As a common image editing operation, image composition involves integrating foreground objects into background scenes. In this paper, we expand the application of the concept of Affordance from human-centered image composition tasks to a more general object-scene compositi...

Dec 21, 2024•21 min•Ep. 255

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

🤗 Upvotes: 12 | cs.CV Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, Limin Wang Title: LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis Arxiv: http://arxiv.org/abs/2412.15214v1 Abstract: The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plan...

Dec 21, 2024•21 min•Ep. 254

DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

🤗 Upvotes: 8 | cs.CV, cs.AI, cs.GR Authors: Wang Zhao, Yan-Pei Cao, Jiale Xu, Yuejiang Dong, Ying Shan Title: DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation Arxiv: http://arxiv.org/abs/2412.15200v1 Abstract: Procedural Content Generation (PCG) is powerful in creating high-quality 3D contents, yet controlling it to produce desired shapes is difficult and often requires extensive parameter tuning. Inverse Procedural Content Generation ai...

Dec 21, 2024•23 min•Ep. 253

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

🤗 Upvotes: 7 | cs.CL, cs.AI, cs.LG Authors: Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping Title: AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling Arxiv: http://arxiv.org/abs/2412.15084v1 Abstract: In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To devel...

Dec 21, 2024•24 min•Ep. 252

No More Adam: Learning Rate Scaling at Initialization is All You Need

🤗 Upvotes: 177 | cs.LG, cs.AI Authors: Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen Title: No More Adam: Learning Rate Scaling at Initialization is All You Need Arxiv: http://arxiv.org/abs/2412.11768v2 Abstract: In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct paramet...

Dec 20, 2024•22 min•Ep. 251

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

🤗 Upvotes: 36 | cs.CL, cs.AI Authors: Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli Title: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Arxiv: http://arxiv.org/abs/2412.13663v2 Abstract: Encoder-only transformer models such as BERT offer a g...

Dec 20, 2024•22 min•Ep. 250

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

🤗 Upvotes: 30 | cs.CL Authors: Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig Title: TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks Arxiv: http://arxiv.org/abs/2412.14161v1 Abstract: We interact with computers on an everyday basis, be it in everyd...

Dec 20, 2024•25 min•Ep. 249

AniDoc: Animation Creation Made Easier

🤗 Upvotes: 29 | cs.CV Authors: Yihao Meng, Hao Ouyang, Hanlin Wang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Zhiheng Liu, Yujun Shen, Huamin Qu Title: AniDoc: Animation Creation Made Easier Arxiv: http://arxiv.org/abs/2412.14173v1 Abstract: The production of 2D animation follows an industry-standard workflow, encompassing four essential stages: character design, keyframe animation, in-betweening, and coloring. Our research focuses on reducing the labor costs in the above process by harnessing the ...

Dec 20, 2024•22 min•Ep. 248

FashionComposer: Compositional Fashion Image Generation

🤗 Upvotes: 13 | cs.CV Authors: Sihui Ji, Yiyang Wang, Xi Chen, Xiaogang Xu, Hao Luo, Hengshuang Zhao Title: FashionComposer: Compositional Fashion Image Generation Arxiv: http://arxiv.org/abs/2412.14168v2 Abstract: We present FashionComposer for compositional fashion image generation. Unlike previous methods, FashionComposer is highly flexible. It takes multi-modal input (i.e., text prompt, parametric human model, garment image, and face image) and supports personalizing the appearance, pose, a...

Dec 20, 2024•20 min•Ep. 247

GUI Agents: A Survey

🤗 Upvotes: 11 | cs.AI, cs.HC Authors: Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt Title: GUI Agents: A Survey Arxiv: http://arxiv.org/abs/2412.13501v1 Abstract: G...

Dec 20, 2024•21 min•Ep. 246

Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

🤗 Upvotes: 10 | cs.LG, cs.RO Authors: Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov Title: Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning Arxiv: http://arxiv.org/abs/2412.12953v1 Abstract: Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computa...

Dec 20, 2024•23 min•Ep. 245

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

🤗 Upvotes: 10 | cs.CV Authors: Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang Title: Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation Arxiv: http://arxiv.org/abs/2412.14015v1 Abstract: Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new pa...

Dec 20, 2024•21 min•Ep. 244

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

🤗 Upvotes: 9 | cs.CV Authors: Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, Saining Xie Title: Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Arxiv: http://arxiv.org/abs/2412.14171v1 Abstract: Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We presen...

Dec 20, 2024•21 min•Ep. 243

Are Your LLMs Capable of Stable Reasoning?

🤗 Upvotes: 61 | cs.AI, cs.CL Authors: Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen Title: Are Your LLMs Capable of Stable Reasoning? Arxiv: http://arxiv.org/abs/2412.13147v2 Abstract: The rapid advancement of Large Language Models (LLMs) has demonstrated remarkable progress in complex reasoning tasks. However, a significant discrepancy persists between benchmark performances and real-world applications. We identify this gap a...

Dec 19, 2024•24 min•Ep. 242

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

🤗 Upvotes: 29 | cs.AI, cs.CL, cs.CV Authors: YiFan Zhang, Shanglin Lei, Runqi Qiao, Zhuoma GongQue, Xiaoshuai Song, Guanting Dong, Qiuna Tan, Zhe Wei, Peiqing Yang, Ye Tian, Yadong Xue, Xiaofei Wang, Honggang Zhang Title: Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models Arxiv: http://arxiv.org/abs/2412.12606v1 Abstract: The rapidly developing field of large multimodal models (LMMs) has led to the emergence of diverse models with remarkable capabilit...

Dec 19, 2024•23 min•Ep. 241

OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

🤗 Upvotes: 29 | cs.CL Authors: Shuting Wang, Jiejun Tan, Zhicheng Dou, Ji-Rong Wen Title: OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain Arxiv: http://arxiv.org/abs/2412.13018v1 Abstract: As a typical and practical application of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) techniques have gained extensive attention, particularly in vertical domains where LLMs may lack domain-specific knowledge. In this paper, we introduce an omnidi...

Dec 19, 2024•23 min•Ep. 240

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

🤗 Upvotes: 21 | cs.CL Authors: Jeffrey Cheng, Benjamin Van Durme Title: Compressed Chain of Thought: Efficient Reasoning Through Dense Representations Arxiv: http://arxiv.org/abs/2412.13171v1 Abstract: Chain-of-thought (CoT) decoding enables language models to improve reasoning performance at the cost of high generation latency in decoding. Recent proposals have explored variants of contemplation tokens, a term we introduce that refers to special tokens used during inference to allow for extra ...

Dec 19, 2024•23 min•Ep. 239

Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers

🤗 Upvotes: 9 | cs.CL, cs.AI, cs.LG Authors: Seungwook Han, Jinyeop Song, Jeff Gore, Pulkit Agrawal Title: Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers Arxiv: http://arxiv.org/abs/2412.12276v2 Abstract: Humans distill complex experiences into fundamental abstractions that enable rapid learning and adaptation. Similarly, autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of ...

Dec 19, 2024•23 min•Ep. 238

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

🤗 Upvotes: 7 | cs.CV Authors: Mark Endo, Xiaohan Wang, Serena Yeung-Levy Title: Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration Arxiv: http://arxiv.org/abs/2412.13180v1 Abstract: Recent works on accelerating Vision-Language Models show that strong performance can be maintained across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual t...

Dec 19, 2024•21 min•Ep. 237

Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CV Authors: Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Erran Li Title: Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents Arxiv: http://arxiv.org/abs/2412.13194v1 Abstract: The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the gener...

Dec 19, 2024•24 min•Ep. 236

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

🤗 Upvotes: 4 | cs.CL Authors: Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A. Rossi, Dinesh Manocha Title: VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation Arxiv: http://arxiv.org/abs/2412.10704v1 Abstract: Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first c...

Dec 19, 2024•23 min•Ep. 235

SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

🤗 Upvotes: 2 | cs.CV Authors: Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Nanxuan Zhao, Jing Shi, Tong Sun Title: SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner Arxiv: http://arxiv.org/abs/2412.10533v1 Abstract: We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by u...

Dec 19, 2024•20 min•Ep. 234

Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion

🤗 Upvotes: 2 | cs.CV, cs.LG Authors: Massimiliano Viola, Kevin Qu, Nando Metzger, Bingxin Ke, Alexander Becker, Konrad Schindler, Anton Obukhov Title: Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion Arxiv: http://arxiv.org/abs/2412.13389v1 Abstract: Depth completion upgrades sparse depth measurements into dense depth maps guided by a conventional image. Existing methods for this highly ill-posed task operate in tightly constrained settings and tend to struggle when appli...

Dec 19, 2024•21 min•Ep. 233

Byte Latent Transformer: Patches Scale Better Than Tokens

🤗 Upvotes: 39 | cs.CL Authors: Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srinivasan Iyer Title: Byte Latent Transformer: Patches Scale Better Than Tokens Arxiv: http://arxiv.org/abs/2412.09871v1 Abstract: We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performanc...

Dec 18, 2024•25 min•Ep. 232

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android