Daily Paper Cast - podcast cover

Daily Paper Cast

Jingwen Liang, Gengyu Wangdailypapercast.transistor.fm
We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art
Last refreshed:
Follow this podcast in the Metacast mobile app to refresh it and see new episodes.
Download Metacast podcast app
Podcasts are better in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

🤗 Upvotes: 43 | cs.CL, cs.GR, cs.MA Authors: Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang Title: FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces Arxiv: http://arxiv.org/abs/2501.12909v1 Abstract: Virtual film production requires intricate decision-making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in au...

Jan 24, 202525 minEp. 411

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

🤗 Upvotes: 42 | cs.CL Authors: Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng Title: Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback Arxiv: http://arxiv.org/abs/2501.12895v1 Abstract: Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preference...

Jan 24, 202523 minEp. 410

Kimi k1.5: Scaling Reinforcement Learning with LLMs

🤗 Upvotes: 39 | cs.AI, cs.LG Authors: Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Z...

Jan 24, 202519 minEp. 409

Autonomy-of-Experts Models

🤗 Upvotes: 31 | cs.CL, cs.AI, cs.LG Authors: Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan Title: Autonomy-of-Experts Models Arxiv: http://arxiv.org/abs/2501.13074v1 Abstract: Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet...

Jan 24, 202520 minEp. 408

O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

🤗 Upvotes: 13 | cs.CL Authors: Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao Title: O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning Arxiv: http://arxiv.org/abs/2501.12570v1 Abstract: Recently, long-thought reasoning LLMs, such as OpenAI's O1, adopt extended reasoning processes similar to how humans ponder over complex problems. This reasoning paradigm significantly enhances the model's problem-solving abilities a...

Jan 24, 202523 minEp. 407

Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament

🤗 Upvotes: 13 | cs.CL Authors: Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li Title: Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament Arxiv: http://arxiv.org/abs/2501.13007v1 Abstract: Best-of-N (BoN) sampling, a common strategy for test-time scaling of Large Language Models (LLMs), relies on reward models to select the best candidate solution from multiple generations. However, traditional reward models often assign arbitrary and inconsistent scores, limiting the...

Jan 24, 202522 minEp. 406

IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

🤗 Upvotes: 7 | cs.CL, cs.AI, cs.LG Authors: Elad Levi, Ilan Kadar Title: IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems Arxiv: http://arxiv.org/abs/2501.11067v1 Abstract: Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, ...

Jan 24, 202525 minEp. 405

Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass

🤗 Upvotes: 3 | cs.CV, cs.AI, cs.GR, cs.RO Authors: Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, Matt Feiszli Title: Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass Arxiv: http://arxiv.org/abs/2501.13928v1 Abstract: Multi-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives. Current leading met...

Jan 24, 202522 minEp. 404

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

🤗 Upvotes: 61 | cs.AI Authors: Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, Jiecao Chen Title: Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training Arxiv: http://arxiv.org/abs/2501.11425v1 Abstract: Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive environments. Existing work mainly focuses on enhancing performance through behavior cloning from stronger experts, yet such approaches often falter in rea...

Jan 23, 202521 minEp. 403

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

🤗 Upvotes: 59 | cs.CV, cs.AI, cs.CL Authors: Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan Title: MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Arxiv: http://arxiv.org/abs/2501.12380v1 Abstract: We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluatin...

Jan 23, 202525 minEp. 402

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

🤗 Upvotes: 51 | cs.LG, cs.CL Authors: Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin Title: Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models Arxiv: http://arxiv.org/abs/2501.11873v1 Abstract: This paper revisits the implementation of $\textbf{L}$oad-$\textbf{b}$alancing $\textbf{L}$oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs i...

Jan 23, 202524 minEp. 401

TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

🤗 Upvotes: 32 | cs.CV Authors: Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel Title: TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space Arxiv: http://arxiv.org/abs/2501.12224v1 Abstract: We present TokenVerse -- a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as ...

Jan 23, 202526 minEp. 400

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

🤗 Upvotes: 31 | cs.AI, cs.CL, cs.CV, cs.HC Authors: Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi Title: UI-TARS: Pioneering Automat...

Jan 23, 202521 minEp. 399

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

🤗 Upvotes: 26 | cs.CV, cs.CL Authors: Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang Title: InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model Arxiv: http://arxiv.org/abs/2501.12368v1 Abstract: Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs)...

Jan 23, 202521 minEp. 398

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

🤗 Upvotes: 20 | cs.CL, cs.CV Authors: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji Title: Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks Arxiv: http://arxiv.org/abs/2501.11733v1 Abstract: Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and...

Jan 23, 202523 minEp. 397

Reasoning Language Models: A Blueprint

🤗 Upvotes: 18 | cs.AI, cs.CL Authors: Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler Title: Reasoning Language Models: A Blueprint Arxiv: http://arxiv.org/abs/2501.11223v2 Abstract: Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), s...

Jan 23, 202522 minEp. 396

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

🤗 Upvotes: 16 | cs.CV Authors: Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, P...

Jan 23, 202521 minEp. 395

Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

🤗 Upvotes: 15 | cs.LG, cs.AI Authors: Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan Ö. Arık Title: Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments Arxiv: http://arxiv.org/abs/2501.10893v1 Abstract: Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks ar...

Jan 23, 202519 minEp. 394

GameFactory: Creating New Games with Generative Interactive Videos

🤗 Upvotes: 48 | cs.CV Authors: Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu Title: GameFactory: Creating New Games with Generative Interactive Videos Arxiv: http://arxiv.org/abs/2501.08325v1 Abstract: Generative game engines have the potential to revolutionize game development by autonomously creating new content and reducing manual workload. However, existing video-based game generation methods fail to address the critical challenge of scene generalization, limiting their...

Jan 22, 202523 minEp. 393

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

🤗 Upvotes: 8 | cs.CV Authors: Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin Title: VideoWorld: Exploring Knowledge Learning from Unlabeled Videos Arxiv: http://arxiv.org/abs/2501.09781v1 Abstract: This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model tra...

Jan 22, 202519 minEp. 392

SEAL: Entangled White-box Watermarks on Low-Rank Adaptation

🤗 Upvotes: 2 | cs.AI, cs.CR Authors: Giyeong Oh, Saejin Kim, Woohyun Cho, Sangkyu Lee, Jiwan Chung, Dokyung Song, Youngjae Yu Title: SEAL: Entangled White-box Watermarks on Low-Rank Adaptation Arxiv: http://arxiv.org/abs/2501.09284v2 Abstract: Recently, LoRA and its variants have become the de facto strategy for training and sharing task-specific versions of large pretrained models, thanks to their efficiency and simplicity. However, the issue of copyright protection for LoRA weights, especiall...

Jan 22, 202522 minEp. 391

The Lessons of Developing Process Reward Models in Mathematical Reasoning

🤗 Upvotes: 53 | cs.CL, cs.AI, cs.LG Authors: Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin Title: The Lessons of Developing Process Reward Models in Mathematical Reasoning Arxiv: http://arxiv.org/abs/2501.07301v1 Abstract: Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the ...

Jan 15, 202519 minEp. 390

Tensor Product Attention Is All You Need

🤗 Upvotes: 38 | cs.CL, cs.AI, cs.LG Authors: Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao Title: Tensor Product Attention Is All You Need Arxiv: http://arxiv.org/abs/2501.06425v1 Abstract: Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that use...

Jan 15, 202521 minEp. 389

$\text{Transformer}^2$: Self-adaptive LLMs

🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL Authors: Qi Sun, Edoardo Cetin, Yujin Tang Title: $\text{Transformer}^2$: Self-adaptive LLMs Arxiv: http://arxiv.org/abs/2501.06252v2 Abstract: Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce $\text{Transformer}^2$, a novel self-adaptation framework that adapts LLMs for unseen tasks in rea...

Jan 15, 202527 minEp. 388

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

🤗 Upvotes: 21 | cs.CL, cs.AI, cs.HC, cs.SD, eess.AS Authors: Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou Title: Mi...

Jan 15, 202524 minEp. 387

VideoAuteur: Towards Long Narrative Video Generation

🤗 Upvotes: 21 | cs.CV Authors: Junfei Xiao, Feng Cheng, Lu Qi, Liangke Gui, Jiepeng Cen, Zhibei Ma, Alan Yuille, Lu Jiang Title: VideoAuteur: Towards Long Narrative Video Generation Arxiv: http://arxiv.org/abs/2501.06173v1 Abstract: Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support ...

Jan 15, 202522 minEp. 386

O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning

🤗 Upvotes: 18 | cs.CL Authors: Zhongzhen Huang, Gui Geng, Shengyi Hua, Zhen Huang, Haoyang Zou, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang Title: O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning Arxiv: http://arxiv.org/abs/2501.06458v1 Abstract: Building upon our previous investigations of O1 replication (Part 1: Journey Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]), this work explores the potential of inference-time scaling in large la...

Jan 15, 202525 minEp. 385

WebWalker: Benchmarking LLMs in Web Traversal

🤗 Upvotes: 16 | cs.CL, cs.AI Authors: Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang Title: WebWalker: Benchmarking LLMs in Web Traversal Arxiv: http://arxiv.org/abs/2501.07572v2 Abstract: Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handl...

Jan 15, 202524 minEp. 384

SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

🤗 Upvotes: 12 | cs.LG, cs.AI, cs.CL Authors: Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei Liu Title: SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training Arxiv: http://arxiv.org/abs/2501.06842v1 Abstract: Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability...

Jan 15, 202521 minEp. 383

UnCommon Objects in 3D

🤗 Upvotes: 8 | cs.CV, cs.AI, cs.GR Authors: Xingchen Liu, Piyush Tayal, Jianyuan Wang, Jesus Zarzar, Tom Monnier, Konstantinos Tertikas, Jiali Duan, Antoine Toisoul, Jason Y. Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, David Novotny Title: UnCommon Objects in 3D Arxiv: http://arxiv.org/abs/2501.07574v1 Abstract: We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI. uCO3D is the largest publicly-available collection of...

Jan 15, 202522 minEp. 382
Hosted on Transistor
For the best experience, listen in Metacast app for iOS or Android