Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

🤗 Paper Upvotes: 25 | cs.CL Authors: Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin Title: Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models Arxiv: http://arxiv.org/abs/2411.04996v1 Abstract: The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Tr...

Nov 09, 2024•25 min•Ep. 51

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

🤗 Paper Upvotes: 20 | cs.CV Authors: Wenhao Wang, Yi Yang Title: TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation Arxiv: http://arxiv.org/abs/2411.04709v1 Abstract: Video generation models are revolutionizing content creation, with image-to-video models drawing increasing attention due to their enhanced controllability, visual consistency, and practical applications. However, despite their popularity, these models rely on user-provided text and image pro...

Nov 09, 2024•25 min•Ep. 50

Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model

🤗 Paper Upvotes: 15 | cs.CL Authors: Young-Jun Lee, Dokyong Lee, Junyoung Youn, Kyeongjin Oh, Ho-Jin Choi Title: Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model Arxiv: http://arxiv.org/abs/2411.04496v1 Abstract: To increase social bonding with interlocutors, humans naturally acquire the ability to respond appropriately in a given situation by considering which conversational skill is most suitable for the response - a process we call skill-of-mind. For la...

Nov 09, 2024•23 min•Ep. 49

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

🤗 Paper Upvotes: 14 | cs.CL Authors: Jonathan Roberts, Kai Han, Samuel Albanie Title: Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? Arxiv: http://arxiv.org/abs/2411.05000v1 Abstract: As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant informa...

Nov 09, 2024•22 min•Ep. 48

DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation

🤗 Paper Upvotes: 12 | cs.RO, cs.LG Authors: Peiqi Liu, Zhanqiu Guo, Mohit Warke, Soumith Chintala, Chris Paxton, Nur Muhammad Mahi Shafiullah, Lerrel Pinto Title: DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation Arxiv: http://arxiv.org/abs/2411.04999v1 Abstract: Significant progress has been made in open-vocabulary mobile manipulation, where the goal is for a robot to perform tasks in any environment given a natural language description. However, most current sy...

Nov 09, 2024•21 min•Ep. 47

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

🤗 Paper Upvotes: 12 | cs.CV Authors: Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, Salman Khan Title: VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos Arxiv: http://arxiv.org/abs/2411.04923v1 Abstract: Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel...

Nov 09, 2024•28 min•Ep. 46

Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

🤗 Paper Upvotes: 33 | cs.CV, cs.AI, cs.CL, cs.MM Authors: Dingjie Song, Sicheng Lai, Shunian Chen, Lichao Sun, Benyou Wang Title: Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination Arxiv: http://arxiv.org/abs/2411.03823v1 Abstract: The rapid progression of multimodal large language models (MLLMs) has demonstrated superior performance on various multimodal benchmarks. However, the issue of data contamination during training creates challenges in performance e...

Nov 08, 2024•24 min•Ep. 45

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

🤗 Paper Upvotes: 26 | cs.LG, cs.AI Authors: Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balazs Kegl, Haitham Bou-Ammar, Jun Wang Title: Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level Arxiv: http://arxiv.org/abs/2411.03562v1 Abstract...

Nov 08, 2024•20 min•Ep. 44

Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models

🤗 Paper Upvotes: 10 | cs.CL, cs.AI, cs.LG Authors: Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma Title: Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models Arxiv: http://arxiv.org/abs/2411.03884v1 Abstract: Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in t...

Nov 08, 2024•23 min•Ep. 43

Self-Consistency Preference Optimization

🤗 Paper Upvotes: 5 | cs.CL, cs.AI, cs.LG Authors: Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason Weston, Jane Yu Title: Self-Consistency Preference Optimization Arxiv: http://arxiv.org/abs/2411.04109v1 Abstract: Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the dif...

Nov 08, 2024•21 min•Ep. 42

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

🤗 Paper Upvotes: 3 | cs.CL Authors: Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz Title: From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond Arxiv: http://arxiv.org/abs/2411.03590v1 Abstract: Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused t...

Nov 08, 2024•17 min•Ep. 41

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

🤗 Paper Upvotes: 34 | cs.IR Authors: Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, Ji-Rong Wen Title: HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems Arxiv: http://arxiv.org/abs/2411.02959v1 Abstract: Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities and alleviate the hallucination problem of LLMs. The Web is a major source of external knowledge used in RAG systems, and many commercial systems such as Cha...

Nov 07, 2024•21 min•Ep. 40

LLaMo: Large Language Model-based Molecular Graph Assistant

🤗 Paper Upvotes: 13 | cs.LG, cs.AI, q-bio.MN Authors: Jinyoung Park, Minseong Bae, Dohwan Ko, Hyunwoo J. Kim Title: LLaMo: Large Language Model-based Molecular Graph Assistant Arxiv: http://arxiv.org/abs/2411.00871v1 Abstract: Large Language Models (LLMs) have demonstrated remarkable generalization and instruction-following capabilities with instruction tuning. The advancements in LLMs and instruction tuning have led to the development of Large Vision-Language Models (LVLMs). However, the compe...

Nov 07, 2024•25 min•Ep. 39

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

🤗 Paper Upvotes: 10 | cs.RO, cs.AI, cs.LG Authors: Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang Title: DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution Arxiv: http://arxiv.org/abs/2411.02359v1 Abstract: MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM prof...

Nov 07, 2024•19 min•Ep. 38

Controlling Language and Diffusion Models by Transporting Activations

🤗 Paper Upvotes: 8 | cs.LG, cs.AI, cs.CL, cs.CV, 68T07, 49Q22, I.2.6; I.2.7; I.4.8 Authors: Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, Xavier Suau Title: Controlling Language and Diffusion Models by Transporting Activations Arxiv: http://arxiv.org/abs/2410.23054v1 Abstract: The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To ...

Nov 07, 2024•23 min•Ep. 37

Sample-Efficient Alignment for LLMs

🤗 Paper Upvotes: 8 | cs.LG, cs.AI, cs.CL Authors: Zichen Liu, Changyu Chen, Chao Du, Wee Sun Lee, Min Lin Title: Sample-Efficient Alignment for LLMs Arxiv: http://arxiv.org/abs/2411.01493v1 Abstract: We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits. This formulation, subsuming recent paradigms such as online RLHF and online DPO, inh...

Nov 07, 2024•21 min•Ep. 36

DreamPolish: Domain Score Distillation With Progressive Geometry Generation

🤗 Paper Upvotes: 6 | cs.CV, cs.AI Authors: Yean Cheng, Ziqi Cai, Ming Ding, Wendi Zheng, Shiyu Huang, Yuxiao Dong, Jie Tang, Boxin Shi Title: DreamPolish: Domain Score Distillation With Progressive Geometry Generation Arxiv: http://arxiv.org/abs/2411.01602v1 Abstract: We introduce DreamPolish, a text-to-3D generation model that excels in producing refined geometry and high-quality textures. In the geometry construction phase, our approach leverages multiple neural representations to enhance the...

Nov 07, 2024•18 min•Ep. 35

Adaptive Length Image Tokenization via Recurrent Allocation

🤗 Paper Upvotes: 4 | cs.CV, cs.AI, cs.LG, cs.RO Authors: Shivam Duggal, Phillip Isola, Antonio Torralba, William T. Freeman Title: Adaptive Length Image Tokenization via Recurrent Allocation Arxiv: http://arxiv.org/abs/2411.02393v1 Abstract: Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entr...

Nov 07, 2024•21 min•Ep. 34

GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details

🤗 Paper Upvotes: 3 | cs.CV, cs.GR Authors: Zhongjin Luo, Haolin Liu, Chenghong Li, Wanghao Du, Zirong Jin, Wanhu Sun, Yinyu Nie, Weikai Chen, Xiaoguang Han Title: GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details Arxiv: http://arxiv.org/abs/2411.03047v1 Abstract: Neural implicit functions have brought impressive advances to the state-of-the-art of clothed human digitization from multiple or even single images. However, de...

Nov 07, 2024•19 min•Ep. 33

Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge

🤗 Paper Upvotes: 3 | cs.CL Authors: Karthik Soman, Andrew Langdon, Catalina Villouta, Chinmay Agrawal, Lashaw Salta, Braian Peetoom, Gianmarco Bellucci, Orion J Buske Title: Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge Arxiv: http://arxiv.org/abs/2411.02657v1 Abstract: Rare diseases present unique challenges in healthcare, often suffering from delayed diagnosis and fragmented information landscapes. The scarcity of reliable knowledge in these condit...

Nov 07, 2024•26 min•Ep. 32

Inference Optimal VLMs Need Only One Visual Token but Larger Models

🤗 Paper Upvotes: 2 | cs.CV, cs.AI, cs.LG Authors: Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter Title: Inference Optimal VLMs Need Only One Visual Token but Larger Models Arxiv: http://arxiv.org/abs/2411.03312v1 Abstract: Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. However, their real-world deployment is often constrained by high latency during inference due to substantial compute required to process th...

Nov 07, 2024•22 min•Ep. 31

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

🤗 Paper Upvotes: 40 | cs.AI Authors: Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong Title: AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents Arxiv: http://arxiv.org/abs/2410.24024v2 Abstract: Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for tr...

Nov 06, 2024•23 min•Ep. 30

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

🤗 Paper Upvotes: 28 | cs.LG, cs.AI Authors: Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh Title: "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization Arxiv: http://arxiv.org/abs/2411.02355v1 Abstract: Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a compreh...

Nov 06, 2024•25 min•Ep. 29

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

🤗 Paper Upvotes: 25 | cs.CL Authors: Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Xinyue Yang, Jiadai Sun, Yu Yang, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, Yuxiao Dong Title: WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning Arxiv: http://arxiv.org/abs/2411.02337v1 Abstract: Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely ...

Nov 06, 2024•22 min•Ep. 28

MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D

🤗 Paper Upvotes: 20 | cs.CV Authors: Wei Cheng, Juncheng Mu, Xianfang Zeng, Xin Chen, Anqi Pang, Chi Zhang, Zhibin Wang, Bin Fu, Gang Yu, Ziwei Liu, Liang Pan Title: MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D Arxiv: http://arxiv.org/abs/2411.02336v1 Abstract: Texturing is a crucial step in the 3D asset production workflow, which enhances the visual appeal and diversity of 3D assets. Despite recent advancements in Text-to-Texture (T2T) generation, existing methods often ...

Nov 06, 2024•21 min•Ep. 27

Training-free Regional Prompting for Diffusion Transformers

🤗 Paper Upvotes: 19 | cs.CV Authors: Anthony Chen, Jianjin Xu, Wenzhao Zheng, Gaole Dai, Yida Wang, Renrui Zhang, Haofan Wang, Shanghang Zhang Title: Training-free Regional Prompting for Diffusion Transformers Arxiv: http://arxiv.org/abs/2411.02395v1 Abstract: Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, exi...

Nov 06, 2024•17 min•Ep. 26

How Far is Video Generation from World Model: A Physical Law Perspective

🤗 Paper Upvotes: 19 | cs.CV, cs.AI Authors: Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng Title: How Far is Video Generation from World Model: A Physical Law Perspective Arxiv: http://arxiv.org/abs/2411.02385v1 Abstract: OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without hum...

Nov 06, 2024•23 min•Ep. 25

Survey of Cultural Awareness in Language Models: Text and Beyond

🤗 Paper Upvotes: 19 | cs.CL, cs.CV Authors: Siddhesh Pawar, Junyeong Park, Jiho Jin, Arnav Arora, Junho Myung, Srishti Yadav, Faiz Ghifari Haznitrama, Inhwa Song, Alice Oh, Isabelle Augenstein Title: Survey of Cultural Awareness in Language Models: Text and Beyond Arxiv: http://arxiv.org/abs/2411.00860v1 Abstract: Large-scale deployment of large language models (LLMs) in various applications, such as chatbots and virtual assistants, requires LLMs to be culturally sensitive to the user to ensure...

Nov 06, 2024•24 min•Ep. 24

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

🤗 Paper Upvotes: 16 | cs.CL, cs.AI Authors: Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Jun Xia, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpeng Wu, Guanghui Xu, Shaohua Chen, Shuang Chen, Xiao Feng, Yigeng Hong, Junqiang Zheng, Ch...

Nov 06, 2024•18 min•Ep. 23

GenXD: Generating Any 3D and 4D Scenes

🤗 Paper Upvotes: 13 | cs.CV, cs.AI Authors: Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, Jianfeng Wang, Gim Hee Lee, Lijuan Wang Title: GenXD: Generating Any 3D and 4D Scenes Arxiv: http://arxiv.org/abs/2411.02319v2 Abstract: Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we ...

Nov 06, 2024•22 min•Ep. 22

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android