Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

🤗 Upvotes: 13 | cs.CV, cs.AI, cs.GR Authors: Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu Title: Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control Arxiv: http://arxiv.org/abs/2501.03847v1 Abstract: Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation proces...

Jan 09, 2025•23 min•Ep. 351

OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

🤗 Upvotes: 10 | cs.CL, cs.CV Authors: Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Hamid Alinejad-Rokny, Fei Huang Title: OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis Arxiv: http://arxiv.org/abs/2501.04561v1 Abstract: Recent advancements in omnimodal learning have been achieved in understanding and generation across imag...

Jan 09, 2025•21 min•Ep. 350

PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides

🤗 Upvotes: 10 | cs.AI, cs.CL Authors: Hao Zheng, Xinyan Guan, Hao Kong, Jia Zheng, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun Title: PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides Arxiv: http://arxiv.org/abs/2501.03936v1 Abstract: Automatically generating presentations from documents is a challenging task that requires balancing content quality, visual design, and structural coherence. Existing methods primarily focus on improving and evaluating the content qual...

Jan 09, 2025•22 min•Ep. 349

Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

🤗 Upvotes: 6 | cs.CL, cs.AI Authors: Yueqin Yin, Shentao Yang, Yujia Xie, Ziyi Yang, Yuting Sun, Hany Awadalla, Weizhu Chen, Mingyuan Zhou Title: Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model Arxiv: http://arxiv.org/abs/2501.02790v1 Abstract: Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequ...

Jan 09, 2025•23 min•Ep. 348

MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting

🤗 Upvotes: 6 | cs.CV Authors: Sangwoon Kwak, Joonsoo Kim, Jun Young Jeong, Won-Sik Cheong, Jihyong Oh, Munchurl Kim Title: MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting Arxiv: http://arxiv.org/abs/2501.03714v1 Abstract: 3D Gaussian Splatting (3DGS) has made significant strides in scene representation and neural rendering, with intense efforts focused on adapting it for dynamic scenes. Despite delivering remarkable rende...

Jan 09, 2025•21 min•Ep. 347

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

🤗 Upvotes: 38 | cs.CV Authors: Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, Ying Tai Title: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution Arxiv: http://arxiv.org/abs/2501.02976v1 Abstract: Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consiste...

Jan 08, 2025•22 min•Ep. 346

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

🤗 Upvotes: 23 | cs.CV Authors: Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang Title: Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction Arxiv: http://arxiv.org/abs/2501.03218v1 Abstract: Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processi...

Jan 08, 2025•27 min•Ep. 345

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG Authors: Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang Title: BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning Arxiv: http://arxiv.org/abs/2501.03226v1 Abstract: Cutting-edge large language models (LLMs) demonstrate promising performance in solving complex math problems with a divide-and-conquer pipeline and the assistance of in-context lea...

Jan 08, 2025•22 min•Ep. 344

Personalized Graph-Based Retrieval for Large Language Models

🤗 Upvotes: 19 | cs.CL Authors: Steven Au, Cameron J. Dimacali, Ojasmitha Pedirappagari, Namyong Park, Franck Dernoncourt, Yu Wang, Nikos Kanakaris, Hanieh Deilamsalehy, Ryan A. Rossi, Nesreen K. Ahmed Title: Personalized Graph-Based Retrieval for Large Language Models Arxiv: http://arxiv.org/abs/2501.02157v1 Abstract: As large language models (LLMs) evolve, their ability to deliver personalized and context-aware responses offers transformative potential for improving user experiences. Existing ...

Jan 08, 2025•21 min•Ep. 343

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

🤗 Upvotes: 13 | q-bio.GN, cs.AI, cs.CL, cs.LG Authors: Ollie Liu, Sami Jaghouar, Johannes Hagemann, Shangshang Wang, Jason Wiemels, Jeff Kaufman, Willie Neiswanger Title: METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring Arxiv: http://arxiv.org/abs/2501.02045v1 Abstract: We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over ...

Jan 08, 2025•22 min•Ep. 342

GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

🤗 Upvotes: 12 | cs.CV Authors: Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li Title: GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking Arxiv: http://arxiv.org/abs/2501.02690v1 Abstract: 4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Di...

Jan 08, 2025•22 min•Ep. 341

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

🤗 Upvotes: 12 | cs.CV, cs.AI, cs.LG Authors: Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, Adam Polyak Title: Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation Arxiv: http://arxiv.org/abs/2501.03059v1 Abstract: We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic...

Jan 08, 2025•22 min•Ep. 340

TransPixar: Advancing Text-to-Video Generation with Transparency

🤗 Upvotes: 9 | cs.CV Authors: Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen Title: TransPixar: Advancing Text-to-Video Generation with Transparency Arxiv: http://arxiv.org/abs/2501.03006v1 Abstract: Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to ...

Jan 08, 2025•23 min•Ep. 339

AutoPresent: Designing Structured Visuals from Scratch

🤗 Upvotes: 7 | cs.CV, cs.CL Authors: Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, Trevor Darrell Title: AutoPresent: Designing Structured Visuals from Scratch Arxiv: http://arxiv.org/abs/2501.00912v1 Abstract: Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challen...

Jan 08, 2025•19 min•Ep. 338

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

🤗 Upvotes: 41 | cs.RO, cs.CV, cs.LG Authors: Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Peng Gao, Hongsheng Li, Maoqing Yao, Guanghui Ren Title: EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation Arxiv: http://arxiv.org/abs/2501.01895v1 Abstract: We introduce EnerVerse, a comprehensive framework for embodied future space generation specifically designed for robotic manipulation tasks. EnerVerse seamlessly integrates convolutional and bi...

Jan 07, 2025•25 min•Ep. 337

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

🤗 Upvotes: 23 | cs.CV, cs.SD, eess.AS Authors: Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He Title: VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction Arxiv: http://arxiv.org/abs/2501.01957v1 Abstract: Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on th...

Jan 07, 2025•21 min•Ep. 336

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

🤗 Upvotes: 12 | cs.CV Authors: Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, Yuxiao Dong Title: VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation Arxiv: http://arxiv.org/abs/2412.21059v1 Abstract: We present a general strategy to aligning visual gene...

Jan 07, 2025•23 min•Ep. 335

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

🤗 Upvotes: 12 | cs.CV, cs.AI Authors: Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen Title: Virgo: A Preliminary Exploration on Reproducing o1-like MLLM Arxiv: http://arxiv.org/abs/2501.01904v1 Abstract: Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this ca...

Jan 07, 2025•23 min•Ep. 334

SDPO: Segment-Level Direct Preference Optimization for Social Agents

🤗 Upvotes: 10 | cs.AI, cs.CL Authors: Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang Title: SDPO: Segment-Level Direct Preference Optimization for Social Agents Arxiv: http://arxiv.org/abs/2501.01821v1 Abstract: Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in al...

Jan 07, 2025•20 min•Ep. 333

Graph Generative Pre-trained Transformer

🤗 Upvotes: 9 | cs.LG, cs.AI Authors: Xiaohui Chen, Yinkai Wang, Jiaxing He, Yuanqi Du, Soha Hassoun, Xiaolin Xu, Li-Ping Liu Title: Graph Generative Pre-trained Transformer Arxiv: http://arxiv.org/abs/2501.01073v1 Abstract: Graph generation is a critical task in numerous domains, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. While most modern graph generative models utilize adjacency matrix representations, this wo...

Jan 07, 2025•20 min•Ep. 332

LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

🤗 Upvotes: 7 | cs.CL, cs.IR Authors: Hieu Man, Nghia Trung Ngo, Viet Dac Lai, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen Title: LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models Arxiv: http://arxiv.org/abs/2501.00874v1 Abstract: Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. Howev...

Jan 07, 2025•23 min•Ep. 331

BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

🤗 Upvotes: 5 | cs.LG, cs.AI Authors: Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Louise Li, Aditi Bhaskar, Mohammed Zaman, Noah D. Goodman Title: BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery Arxiv: http://arxiv.org/abs/2501.01540v1 Abstract: Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and then revising them based o...

Jan 07, 2025•26 min•Ep. 330

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

🤗 Upvotes: 45 | cs.CV, cs.CL, cs.LG Authors: Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing Title: 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Arxiv: http://arxiv.org/abs/2501.00958v1 Abstract: Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing c...

Jan 04, 2025•24 min•Ep. 329

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

🤗 Upvotes: 30 | cs.CL Authors: Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, Junyang Lin Title: CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings Arxiv: http://arxiv.org/abs/2501.01257v1 Abstract: With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthrough...

Jan 04, 2025•24 min•Ep. 328

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

🤗 Upvotes: 30 | cs.CV Authors: Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, Hengshuang Zhao Title: VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control Arxiv: http://arxiv.org/abs/2501.01427v1 Abstract: Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motions at the same time. In t...

Jan 04, 2025•19 min•Ep. 327

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

🤗 Upvotes: 25 | cs.CV, cs.LG Authors: Jingfeng Yao, Xinggang Wang Title: Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models Arxiv: http://arxiv.org/abs/2501.01423v1 Abstract: Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requi...

Jan 04, 2025•25 min•Ep. 326

ProgCo: Program Helps Self-Correction of Large Language Models

🤗 Upvotes: 17 | cs.CL, cs.AI, cs.LG Authors: Xiaoshuai Song, Yanan Wu, Weixun Wang, Jiaheng Liu, Wenbo Su, Bo Zheng Title: ProgCo: Program Helps Self-Correction of Large Language Models Arxiv: http://arxiv.org/abs/2501.01264v1 Abstract: Self-Correction aims to enable large language models (LLMs) to self-verify and self-refine their initial responses without external feedback. However, LLMs often fail to effectively self-verify and generate correct feedback, further misleading refinement and lea...

Jan 04, 2025•20 min•Ep. 325

MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

🤗 Upvotes: 16 | cs.CL Authors: Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, Md Rizwan Parvez Title: MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models Arxiv: http://arxiv.org/abs/2501.00316v1 Abstract: Recent advancements in foundation models have enhanced AI systems' capabilities in autonomous tool usage and reasoning. However, their ability in location or map-based reasoning - whi...

Jan 04, 2025•26 min•Ep. 324

A3: Android Agent Arena for Mobile GUI Agents

🤗 Upvotes: 15 | cs.AI Authors: Yuxiang Chai, Hanhao Li, Jiayu Zhang, Liang Liu, Guozhi Wang, Shuai Ren, Siyuan Huang, Hongsheng Li Title: A3: Android Agent Arena for Mobile GUI Agents Arxiv: http://arxiv.org/abs/2501.01149v1 Abstract: AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous st...

Jan 04, 2025•24 min•Ep. 323

MLLM-as-a-Judge for Image Safety without Human Labeling

🤗 Upvotes: 14 | cs.CV, cs.CL, cs.CY, cs.LG Authors: Zhenting Wang, Shuming Hu, Shiyu Zhao, Xiaowen Lin, Felix Juefei-Xu, Zhuowei Li, Ligong Han, Harihar Subramanyam, Li Chen, Jianfa Chen, Nan Jiang, Lingjuan Lyu, Shiqing Ma, Dimitris N. Metaxas, Ankit Jain Title: MLLM-as-a-Judge for Image Safety without Human Labeling Arxiv: http://arxiv.org/abs/2501.00192v1 Abstract: Image content safety has become a significant challenge with the rise of visual media on online platforms. Meanwhile, in the age...

Jan 04, 2025•22 min•Ep. 322

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android