Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Loss-to-Loss Prediction: Scaling Laws for All Datasets

🤗 Paper Upvotes: 2 | cs.LG, cs.AI, cs.CL, stat.ML Authors: David Brandfonbrener, Nikhil Anand, Nikhil Vyas, Eran Malach, Sham Kakade Title: Loss-to-Loss Prediction: Scaling Laws for All Datasets Arxiv: http://arxiv.org/abs/2411.12925v1 Abstract: While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy ...

Nov 22, 2024•22 min•Ep. 111

ORID: Organ-Regional Information Driven Framework for Radiology Report Generation

🤗 Paper Upvotes: 2 | cs.CV Authors: Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai Title: ORID: Organ-Regional Information Driven Framework for Radiology Report Generation Arxiv: http://arxiv.org/abs/2411.13025v1 Abstract: The objective of Radiology Report Generation (RRG) is to automatically generate coherent textual analyses of diseases based on radiological images, thereby alleviating the workload of radiologists. Current AI-based methods for RRG primarily focus...

Nov 22, 2024•20 min•Ep. 110

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization

🤗 Paper Upvotes: 13 | cs.CV Authors: Hongrui Jia, Chaoya Jiang, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang Title: SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization Arxiv: http://arxiv.org/abs/2411.11909v1 Abstract: As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by pre...

Nov 21, 2024•25 min•Ep. 109

Continuous Speculative Decoding for Autoregressive Image Generation

🤗 Paper Upvotes: 13 | cs.CV Authors: Zili Wang, Robert Zhang, Kun Ding, Qi Yang, Fei Li, Shiming Xiang Title: Continuous Speculative Decoding for Autoregressive Image Generation Arxiv: http://arxiv.org/abs/2411.11925v1 Abstract: Continuous-valued Autoregressive (AR) image generation models have demonstrated notable superiority over their discrete-token counterparts, showcasing considerable reconstruction quality and higher generation fidelity. However, the computational demands of the autoregre...

Nov 21, 2024•23 min•Ep. 108

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

🤗 Paper Upvotes: 11 | cs.CV Authors: M. Arda Aydın, Efe Mert Çırpar, Elvin Abdinli, Gozde Unal, Yusuf H. Sahin Title: ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements Arxiv: http://arxiv.org/abs/2411.12044v1 Abstract: Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vis...

Nov 21, 2024•19 min•Ep. 107

FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations

🤗 Paper Upvotes: 10 | cs.GR, cs.CV Authors: Hmrishav Bandyopadhyay, Yi-Zhe Song Title: FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations Arxiv: http://arxiv.org/abs/2411.10818v1 Abstract: Sketch animations offer a powerful medium for visual storytelling, from simple flip-book doodles to professional studio productions. While traditional animation requires teams of skilled artists to draw key frames and in-between frames, existing automation attempts still demand significant ...

Nov 21, 2024•26 min•Ep. 106

Soft Robotic Dynamic In-Hand Pen Spinning

🤗 Paper Upvotes: 8 | cs.RO Authors: Yunchao Yao, Uksang Yoo, Jean Oh, Christopher G. Atkeson, Jeffrey Ichnowski Title: Soft Robotic Dynamic In-Hand Pen Spinning Arxiv: http://arxiv.org/abs/2411.12734v1 Abstract: Dynamic in-hand manipulation remains a challenging task for soft robotic systems that have demonstrated advantages in safe compliant interactions but struggle with high-speed dynamic tasks. In this work, we present SWIFT, a system for learning dynamic tasks using a soft and compliant ro...

Nov 21, 2024•23 min•Ep. 105

Building Trust: Foundations of Security, Safety and Transparency in AI

🤗 Paper Upvotes: 8 | cs.CY, cs.AI, cs.CL Authors: Huzaifa Sidhpurwala, Garth Mollett, Emily Fox, Mark Bestavros, Huamin Chen Title: Building Trust: Foundations of Security, Safety and Transparency in AI Arxiv: http://arxiv.org/abs/2411.12275v1 Abstract: This paper explores the rapidly evolving ecosystem of publicly available AI models, and their potential implications on the security and safety landscape. As AI models become increasingly prevalent, understanding their potential risks and vulner...

Nov 21, 2024•22 min•Ep. 104

SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning

🤗 Paper Upvotes: 5 | cs.CV Authors: Zewen Chen, Juan Wang, Wen Wang, Sunhan Xu, Hang Xiong, Yun Zeng, Jian Guo, Shuxun Wang, Chunfeng Yuan, Bing Li, Weiming Hu Title: SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning Arxiv: http://arxiv.org/abs/2411.10161v1 Abstract: Existing Image Quality Assessment (IQA) methods achieve remarkable success in analyzing quality for overall image, but few works explore quality analysis for Regions of In...

Nov 21, 2024•21 min•Ep. 103

Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages

🤗 Paper Upvotes: 3 | cs.CL, cs.AI Authors: S. Tamang, D. J. Bora Title: Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages Arxiv: http://arxiv.org/abs/2411.12240v1 Abstract: Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokeniz...

Nov 21, 2024•24 min•Ep. 102

Generative World Explorer

🤗 Paper Upvotes: 38 | cs.CV Authors: Taiming Lu, Tianmin Shu, Alan Yuille, Daniel Khashabi, Jieneng Chen Title: Generative World Explorer Arxiv: http://arxiv.org/abs/2411.11844v2 Abstract: Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. In contrast, humans can $\textit{imagine}$ unseen parts of the world thro...

Nov 20, 2024•22 min•Ep. 101

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

🤗 Paper Upvotes: 31 | cs.CV, cs.CL Authors: Xudong Lu, Yinghao Chen, Cheng Chen, Hui Tan, Boheng Chen, Yina Xie, Rui Hu, Guanxin Tan, Renshou Wu, Yan Hu, Yi Zeng, Lei Wu, Liuyang Bian, Zhaoxiong Wang, Long Liu, Yanzhou Yang, Han Xiao, Aojun Zhou, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li Title: BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices Arxiv: http://arxiv.org/abs/2411.10640v1 Abstract: The emergence and growing popularity of multim...

Nov 20, 2024•20 min•Ep. 100

Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering

🤗 Paper Upvotes: 13 | cs.AI, cs.CL, stat.ML Authors: Xinyan Guan, Yanjiang Liu, Xinyu Lu, Boxi Cao, Ben He, Xianpei Han, Le Sun, Jie Lou, Bowen Yu, Yaojie Lu, Hongyu Lin Title: Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering Arxiv: http://arxiv.org/abs/2411.11504v1 Abstract: The evolution of machine learning has increasingly prioritized the development of powerful models and more scalable supervision signals. However, the...

Nov 20, 2024•21 min•Ep. 99

AnimateAnything: Consistent and Controllable Animation for Video Generation

🤗 Paper Upvotes: 12 | cs.CV Authors: Guojun Lei, Chi Wang, Hong Li, Rong Zhang, Yikai Wang, Weiwei Xu Title: AnimateAnything: Consistent and Controllable Animation for Video Generation Arxiv: http://arxiv.org/abs/2411.10836v1 Abstract: We present a unified controllable video generation approach AnimateAnything that facilitates precise and consistent video manipulation across various conditions, including camera trajectories, text prompts, and user motion annotations. Specifically, we carefully ...

Nov 20, 2024•22 min•Ep. 98

Top-$nσ$: Not All Logits Are You Need

🤗 Paper Upvotes: 12 | cs.LG Authors: Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang Title: Top-$nσ$: Not All Logits Are You Need Arxiv: http://arxiv.org/abs/2411.07641v1 Abstract: Large language models (LLMs) typically employ greedy decoding or low-temperature sampling for reasoning tasks, reflecting a perceived trade-off between diversity and accuracy. We challenge this convention by introducing top-$n\sigma$, a novel sampling method that operates directly on pre-softmax logits by lever...

Nov 20, 2024•21 min•Ep. 97

Drowning in Documents: Consequences of Scaling Reranker Inference

🤗 Paper Upvotes: 10 | cs.IR, cs.CL, cs.LG Authors: Mathew Jacob, Erik Lindgren, Matei Zaharia, Michael Carbin, Omar Khattab, Andrew Drozdov Title: Drowning in Documents: Consequences of Scaling Reranker Inference Arxiv: http://arxiv.org/abs/2411.11767v1 Abstract: Rerankers, typically cross-encoders, are often used to re-score the documents retrieved by cheaper initial IR systems. This is because, though expensive, rerankers are assumed to be more effective. We challenge this assumption by measu...

Nov 20, 2024•22 min•Ep. 96

SlimLM: An Efficient Small Language Model for On-Device Document Assistance

🤗 Paper Upvotes: 10 | cs.CL Authors: Thang M. Pham, Phat T. Nguyen, Seunghyun Yoon, Viet Dac Lai, Franck Dernoncourt, Trung Bui Title: SlimLM: An Efficient Small Language Model for On-Device Document Assistance Arxiv: http://arxiv.org/abs/2411.09944v1 Abstract: While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mo...

Nov 20, 2024•26 min•Ep. 95

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

🤗 Paper Upvotes: 8 | cs.CV Authors: Jinqiang Long, Yanqi Dai, Guoxing Yang, Hongpeng Lin, Nanyi Fei, Yizhao Gao, Zhiwu Lu Title: Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts Arxiv: http://arxiv.org/abs/2411.10669v1 Abstract: As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world app...

Nov 20, 2024•20 min•Ep. 94

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

🤗 Paper Upvotes: 8 | cs.LG Authors: Joseph Liu, Joshua Geddes, Ziyu Guo, Haomiao Jiang, Mahesh Kumar Nandwana Title: SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers Arxiv: http://arxiv.org/abs/2411.10510v1 Abstract: Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource...

Nov 20, 2024•28 min•Ep. 93

LLäMmlein: Compact and Competitive German-Only Language Models from Scratch

🤗 Paper Upvotes: 7 | cs.CL, cs.AI, cs.LG Authors: Jan Pfister, Julia Wunderle, Andreas Hotho Title: LLäMmlein: Compact and Competitive German-Only Language Models from Scratch Arxiv: http://arxiv.org/abs/2411.11171v1 Abstract: We create two German-only decoder models, LL\"aMmlein 120M and 1B, transparently from scratch and publish them, along with the training data, for the German NLP research community to use. The model training involved several key steps, including extensive data preprocessin...

Nov 20, 2024•22 min•Ep. 92

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

🤗 Paper Upvotes: 64 | cs.CV Authors: Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, Li Yuan Title: LLaVA-o1: Let Vision Language Models Reason Step-by-Step Arxiv: http://arxiv.org/abs/2411.10440v1 Abstract: Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured rea...

Nov 19, 2024•26 min•Ep. 91

GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation

🤗 Paper Upvotes: 19 | cs.CV, cs.AI, cs.GR Authors: Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, Chen Change Loy Title: GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation Arxiv: http://arxiv.org/abs/2411.08033v1 Abstract: While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation ...

Nov 19, 2024•24 min•Ep. 90

Xmodel-1.5: An 1B-scale Multilingual LLM

🤗 Paper Upvotes: 7 | cs.CL Authors: Wang Qun, Liu Yang, Lin Qingquan, Jiang Ling Title: Xmodel-1.5: An 1B-scale Multilingual LLM Arxiv: http://arxiv.org/abs/2411.10083v1 Abstract: We introduce Xmodel-1.5, a novel 1-billion-parameter multilingual large model pretrained on approximately 2 trillion tokens. The model demonstrates strong performance across several languages, with particularly notable results in Thai, Arabic, and French, alongside its effectiveness in Chinese and English. In addition...

Nov 19, 2024•21 min•Ep. 89

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

🤗 Paper Upvotes: 32 | cs.LG, cs.AI, cs.CL, cs.CV, 68T05, I.3.5; I.2.10; I.2.6 Authors: Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, Xiaohui Zeng Title: LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models Arxiv: http://arxiv.org/abs/2411.09595v1 Abstract: This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowled...

Nov 16, 2024•24 min•Ep. 88

MagicQuill: An Intelligent Interactive Image Editing System

🤗 Paper Upvotes: 31 | cs.CV Authors: Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Wen Wang, Zhiheng Liu, Qifeng Chen, Yujun Shen Title: MagicQuill: An Intelligent Interactive Image Editing System Arxiv: http://arxiv.org/abs/2411.09703v1 Abstract: Image editing involves a variety of complex tasks and requires efficient and precise manipulation techniques. In this paper, we present MagicQuill, an integrated image editing system that enables swift actualization of creative ideas. Ou...

Nov 16, 2024•20 min•Ep. 87

Cut Your Losses in Large-Vocabulary Language Models

🤗 Paper Upvotes: 15 | cs.LG, cs.CL Authors: Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, Philipp Krähenbühl Title: Cut Your Losses in Large-Vocabulary Language Models Arxiv: http://arxiv.org/abs/2411.09009v1 Abstract: As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries fo...

Nov 16, 2024•21 min•Ep. 86

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

🤗 Paper Upvotes: 9 | cs.CL Authors: Canyu Chen, Jian Yu, Shan Chen, Che Liu, Zhongwei Wan, Danielle Bitterman, Fei Wang, Kai Shu Title: ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction? Arxiv: http://arxiv.org/abs/2411.06469v1 Abstract: Large Language Models (LLMs) hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoos...

Nov 16, 2024•24 min•Ep. 85

Sharingan: Extract User Action Sequence from Desktop Recordings

🤗 Paper Upvotes: 3 | cs.CV, cs.AI Authors: Yanting Chen, Yi Ren, Xiaoting Qin, Jue Zhang, Kehong Yuan, Lu Han, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang Title: Sharingan: Extract User Action Sequence from Desktop Recordings Arxiv: http://arxiv.org/abs/2411.08768v1 Abstract: Video recordings of user activities, particularly desktop recordings, offer a rich source of data for understanding user behaviors and automating processes. However, despite advancements in Vision-Language Model...

Nov 16, 2024•23 min•Ep. 84

Hermes: A Large Language Model Framework on the Journey to Autonomous Networks

🤗 Paper Upvotes: 2 | cs.AI, cs.NI Authors: Fadhel Ayed, Ali Maatouk, Nicola Piovesan, Antonio De Domenico, Merouane Debbah, Zhi-Quan Luo Title: Hermes: A Large Language Model Framework on the Journey to Autonomous Networks Arxiv: http://arxiv.org/abs/2411.06490v1 Abstract: The drive toward automating cellular network operations has grown with the increasing complexity of these systems. Despite advancements, full autonomy currently remains out of reach due to reliance on human intervention for m...

Nov 16, 2024•22 min•Ep. 83

Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples

🤗 Paper Upvotes: 2 | cs.LG, cs.AI Authors: Noël Vouitsis, Rasa Hosseinzadeh, Brendan Leigh Ross, Valentin Villecroze, Satya Krishna Gorti, Jesse C. Cresswell, Gabriel Loaiza-Ganem Title: Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples Arxiv: http://arxiv.org/abs/2411.08954v1 Abstract: Although diffusion models can generate remarkably high-quality samples, they are intrinsically bottlenecked by their expensive iterative sampling procedure. Consistency mode...

Nov 16, 2024•22 min•Ep. 82

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android