Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Dynamic Scaling of Unit Tests for Code Reward Modeling

🤗 Upvotes: 13 | cs.CL, cs.SE Authors: Zeyao Ma, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, Jie Tang Title: Dynamic Scaling of Unit Tests for Code Reward Modeling Arxiv: http://arxiv.org/abs/2501.01054v1 Abstract: Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit te...

Jan 04, 2025•22 min•Ep. 321

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

🤗 Upvotes: 52 | cs.AI, cs.CL, cs.CV, cs.HC Authors: Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu Title: OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis Arxiv: http://arxiv.org/abs/2412.19723v1 Abstract: Graphical User Interface (GUI) agents powered by Vision-Language Models (VLMs) have demonstrated human-like computer contr...

Jan 03, 2025•23 min•Ep. 320

Xmodel-2 Technical Report

🤗 Upvotes: 13 | cs.AI Authors: Wang Qun, Liu Yang, Lin Qingquan, Qu Zhijiu, Jiang Ling Title: Xmodel-2 Technical Report Arxiv: http://arxiv.org/abs/2412.19638v1 Abstract: Xmodel-2 is a 1.2-billion-parameter large language model designed specifically for reasoning tasks. Its architecture enables different model scales to share a unified set of hyperparameters, allowing for extensive experimentation on smaller models and seamless transfer of optimal configurations to larger models. To maximize tr...

Jan 03, 2025•17 min•Ep. 319

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

🤗 Upvotes: 9 | cs.CV Authors: Sangyun Chung, Youngjoon Yu, Youngchae Chee, Se Yeon Kim, Byung-Kwan Lee, Yong Man Ro Title: Are Vision-Language Models Truly Understanding Multi-vision Sensor? Arxiv: http://arxiv.org/abs/2412.20750v1 Abstract: Large-scale Vision-Language Models (VLMs) have advanced by aligning vision inputs with text, significantly improving performance in computer vision tasks. Moreover, for VLMs to be effectively utilized in real-world applications, an understanding of diverse ...

Jan 03, 2025•25 min•Ep. 318

HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving

🤗 Upvotes: 4 | cs.AI, cs.CL Authors: Yang Li, Dong Du, Linfeng Song, Chen Li, Weikang Wang, Tao Yang, Haitao Mi Title: HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving Arxiv: http://arxiv.org/abs/2412.20735v2 Abstract: We introduce HunyuanProver, an language model finetuned from the Hunyuan 7B for interactive automatic theorem proving with LEAN4. To alleviate the data sparsity issue, we design a scalable framework to iterative synthesize da...

Jan 03, 2025•21 min•Ep. 317

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

🤗 Upvotes: 2 | cs.CV Authors: Shaojin Wu, Fei Ding, Mengqi Huang, Wei Liu, Qian He Title: VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control Arxiv: http://arxiv.org/abs/2412.20800v1 Abstract: While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions includi...

Jan 03, 2025•22 min•Ep. 316

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

🤗 Upvotes: 13 | cs.CL Authors: Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu Title: Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs Arxiv: http://arxiv.org/abs/2412.21187v1 Abstract: The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ ex...

Jan 02, 2025•20 min•Ep. 315

OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System

🤗 Upvotes: 11 | cs.CL, cs.AI, cs.DB, cs.IR, cs.LG Authors: Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, Haofen Wang, Huajun Chen Title: OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System Arxiv: http://arxiv.org/abs/2412.20005v1 Abstract: We introduce OneKE, a dockerized schema-guided knowledge extraction system, which can extract knowledge from the Web and raw PDF Books, and supp...

Jan 02, 2025•19 min•Ep. 314

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

🤗 Upvotes: 39 | cs.CV Authors: Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, Heyang Xu, Yazhou Yao, Errui Ding Title: Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization Arxiv: http://arxiv.org/abs/2412.18525v2 Abstract: Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transfor...

Jan 01, 2025•25 min•Ep. 313

On the Compositional Generalization of Multimodal LLMs for Medical Imaging

🤗 Upvotes: 29 | cs.CV, cs.AI, cs.CL, cs.LG Authors: Zhenyang Cai, Junying Chen, Rongsheng Wang, Weihong Wang, Yonglin Deng, Dingjie Song, Yize Chen, Zixu Zhang, Benyou Wang Title: On the Compositional Generalization of Multimodal LLMs for Medical Imaging Arxiv: http://arxiv.org/abs/2412.20070v1 Abstract: Multimodal large language models (MLLMs) hold significant potential in the medical field, but their capabilities are often limited by insufficient data in certain medical domains, highlighting ...

Jan 01, 2025•23 min•Ep. 312

Bringing Objects to Life: 4D generation from 3D objects

🤗 Upvotes: 24 | cs.CV Authors: Ohad Rahamim, Ori Malca, Dvir Samuel, Gal Chechik Title: Bringing Objects to Life: 4D generation from 3D objects Arxiv: http://arxiv.org/abs/2412.20422v1 Abstract: Recent advancements in generative modeling now enable the creation of 4D content (moving 3D objects) controlled with text prompts. 4D generation has large potential in applications like virtual worlds, media, and gaming, but existing methods provide limited control over the appearance and geometry of ge...

Jan 01, 2025•22 min•Ep. 311

Efficiently Serving LLM Reasoning Programs with Certaindex

🤗 Upvotes: 20 | cs.LG, cs.CL Authors: Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Aurick Qiao, Hao Zhang Title: Efficiently Serving LLM Reasoning Programs with Certaindex Arxiv: http://arxiv.org/abs/2412.20993v1 Abstract: The rapid evolution of large language models (LLMs) has unlocked their capabilities in advanced reasoning tasks like mathematical problem-solving, code generation, and legal analysis. Central to this progress are inference-time reasoning algorithms, which ref...

Jan 01, 2025•20 min•Ep. 310

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

🤗 Upvotes: 14 | cs.SD, cs.AI, cs.CL, eess.AS Authors: Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Rafael Valle, Bryan Catanzaro, Soujanya Poria Title: TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization Arxiv: http://arxiv.org/abs/2412.21037v1 Abstract: We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3...

Jan 01, 2025•21 min•Ep. 309

Edicho: Consistent Image Editing in the Wild

🤗 Upvotes: 13 | cs.CV Authors: Qingyan Bai, Hao Ouyang, Yinghao Xu, Qiuyu Wang, Ceyuan Yang, Ka Leong Cheng, Yujun Shen, Qifeng Chen Title: Edicho: Consistent Image Editing in the Wild Arxiv: http://arxiv.org/abs/2412.21079v1 Abstract: As a verified need, consistent editing across in-the-wild images remains a technical challenge arising from various unmanageable factors, like object poses, lighting conditions, and photography environments. Edicho steps in with a training-free solution based on ...

Jan 01, 2025•23 min•Ep. 308

Facilitating large language model Russian adaptation with Learned Embedding Propagation

🤗 Upvotes: 6 | cs.CL, cs.AI Authors: Mikhail Tikhomirov, Daniil Chernyshev Title: Facilitating large language model Russian adaptation with Learned Embedding Propagation Arxiv: http://arxiv.org/abs/2412.21140v1 Abstract: Rapid advancements of large language model (LLM) technologies led to the introduction of powerful open-source instruction-tuned LLMs that have the same text generation quality as the state-of-the-art counterparts such as GPT-4. While the emergence of such models accelerates the...

Jan 01, 2025•22 min•Ep. 307

Training Software Engineering Agents and Verifiers with SWE-Gym

🤗 Upvotes: 6 | cs.SE, cs.CL Authors: Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang Title: Training Software Engineering Agents and Verifiers with SWE-Gym Arxiv: http://arxiv.org/abs/2412.21139v1 Abstract: We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task spe...

Jan 01, 2025•27 min•Ep. 306

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

🤗 Upvotes: 5 | cs.SE, cs.CL Authors: Zhaojian Yu, Yilun Zhao, Arman Cohan, Xiao-Ping Zhang Title: HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Arxiv: http://arxiv.org/abs/2412.21199v1 Abstract: We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the ...

Jan 01, 2025•21 min•Ep. 305

Slow Perception: Let's Perceive Geometric Figures Step-by-step

🤗 Upvotes: 5 | cs.CV Authors: Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang Title: Slow Perception: Let's Perceive Geometric Figures Step-by-step Arxiv: http://arxiv.org/abs/2412.20631v1 Abstract: Recently, "visual o1" began to enter people's vision, with expectations that this slow-thinking design can solve visual reasoning tasks, especially geometric math problems. However, the reality is that current LVLMs (Large Vision Language Models) can h...

Jan 01, 2025•23 min•Ep. 304

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

🤗 Upvotes: 53 | cs.CL, cs.AI, cs.LG Authors: Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, Benyou Wang Title: HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs Arxiv: http://arxiv.org/abs/2412.18925v1 Abstract: The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though dist...

Dec 31, 2024•23 min•Ep. 303

1.58-bit FLUX

🤗 Upvotes: 24 | cs.CV, cs.AI, cs.LG Authors: Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, Liang-Chieh Chen Title: 1.58-bit FLUX Arxiv: http://arxiv.org/abs/2412.18653v1 Abstract: We present 1.58-bit FLUX, the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 x 1024 images. Notably, our quantizati...

Dec 31, 2024•23 min•Ep. 302

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

🤗 Upvotes: 17 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MM, eess.AS Authors: Liang Chen, Zekun Wang, Shuhuai Ren, Lei Li, Haozhe Zhao, Yunshui Li, Zefan Cai, Hongcheng Guo, Lei Zhang, Yizhe Xiong, Yichi Zhang, Ruoyu Wu, Qingxiu Dong, Ge Zhang, Jian Yang, Lingwei Meng, Shujie Hu, Yulong Chen, Junyang Lin, Shuai Bai, Andreas Vlachos, Xu Tan, Minjia Zhang, Wen Xiao, Aaron Yee, Tianyu Liu, Baobao Chang Title: Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey Arxiv: http://arxiv.o...

Dec 31, 2024•18 min•Ep. 301

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

🤗 Upvotes: 11 | cs.CV Authors: Zehan Wang, Ziang Zhang, Tianyu Pang, Chao Du, Hengshuang Zhao, Zhou Zhao Title: Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models Arxiv: http://arxiv.org/abs/2412.18605v1 Abstract: Orientation is a key attribute of objects, crucial for understanding their spatial pose and arrangement in images. However, practical solutions for accurate orientation estimation from a single image remain underexplored. In this work, we introduce...

Dec 31, 2024•23 min•Ep. 300

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

🤗 Upvotes: 11 | cs.CV Authors: Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang Title: Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment Arxiv: http://arxiv.org/abs/2412.19326v1 Abstract: Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a ...

Dec 31, 2024•25 min•Ep. 299

From Elements to Design: A Layered Approach for Automatic Graphic Design Composition

🤗 Upvotes: 11 | cs.CV Authors: Jiawei Lin, Shizhao Sun, Danqing Huang, Ting Liu, Ji Li, Jiang Bian Title: From Elements to Design: A Layered Approach for Automatic Graphic Design Composition Arxiv: http://arxiv.org/abs/2412.19712v1 Abstract: In this work, we investigate automatic design composition from multimodal graphic elements. Although recent studies have developed various generative models for graphic design, they usually face the following limitations: they only focus on certain subtasks...

Dec 31, 2024•23 min•Ep. 298

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

🤗 Upvotes: 8 | cs.CV Authors: Tao Wu, Yong Zhang, Xiaodong Cun, Zhongang Qi, Junfu Pu, Huanzhang Dou, Guangcong Zheng, Ying Shan, Xi Li Title: VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models Arxiv: http://arxiv.org/abs/2412.19645v2 Abstract: Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject featu...

Dec 31, 2024•24 min•Ep. 297

The Superposition of Diffusion Models Using the Itô Density Estimator

🤗 Upvotes: 8 | cs.LG Authors: Marta Skreta, Lazar Atanackovic, Avishek Joey Bose, Alexander Tong, Kirill Neklyudov Title: The Superposition of Diffusion Models Using the Itô Density Estimator Arxiv: http://arxiv.org/abs/2412.17762v1 Abstract: The Cambrian explosion of easily accessible pre-trained diffusion models suggests a demand for methods that combine multiple different pre-trained diffusion models without incurring the significant computational burden of re-training a larger combined mode...

Dec 31, 2024•23 min•Ep. 296

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

🤗 Upvotes: 6 | cs.CL Authors: Hua Farn, Hsuan Su, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee Title: Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging Arxiv: http://arxiv.org/abs/2412.19512v1 Abstract: Fine-tuning large language models (LLMs) for downstream tasks is a widely adopted approach, but it often leads to safety degradation in safety-aligned LLMs. Currently, many solutions address this issue by incorporating additional safety data, which can be impracti...

Dec 31, 2024•19 min•Ep. 295

CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era

🤗 Upvotes: 3 | cs.CL, cs.AI, cs.DB Authors: Yanlin Feng, Simone Papicchio, Sajjadur Rahman Title: CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era Arxiv: http://arxiv.org/abs/2412.18702v1 Abstract: Retrieval from graph data is crucial for augmenting large language models (LLM) with both open-domain knowledge and private enterprise data, and it is also a key component in the recent GraphRAG system (edge et al., 2024). Despite decades of research on kn...

Dec 31, 2024•25 min•Ep. 294

YuLan-Mini: An Open Data-efficient Language Model

🤗 Upvotes: 27 | cs.CL Authors: Yiwen Hu, Huatong Song, Jia Deng, Jiapeng Wang, Jie Chen, Kun Zhou, Yutao Zhu, Jinhao Jiang, Zican Dong, Wayne Xin Zhao, Ji-Rong Wen Title: YuLan-Mini: An Open Data-efficient Language Model Arxiv: http://arxiv.org/abs/2412.17743v2 Abstract: Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mi...

Dec 28, 2024•20 min•Ep. 293

A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression

🤗 Upvotes: 17 | cs.CL Authors: Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, Dong Yu, Zhicheng Dou Title: A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression Arxiv: http://arxiv.org/abs/2412.17483v1 Abstract: In this work, we provide a thorough investigation of gist-based context compression methods to improve long-context processing in large language models. We focus on two key questions: (1) How well can thes...

Dec 28, 2024•22 min•Ep. 292

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android