Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

🤗 Paper Upvotes: 13 | cs.CV, cs.AI, cs.CL Authors: Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, Huan Zhang Title: DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models Arxiv: http://arxiv.org/abs/2411.00836v1 Abstract: The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to si...

Nov 06, 2024•19 min•Ep. 21

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

🤗 Paper Upvotes: 32 | cs.CL, cs.CV, cs.HC Authors: Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao Title: OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Arxiv: http://arxiv.org/abs/2410.23218v1 Abstract: Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant...

Nov 05, 2024•20 min•Ep. 20

Personalization of Large Language Models: A Survey

🤗 Paper Upvotes: 14 | cs.CL Authors: Zhehao Zhang, Ryan A. Rossi, Branislav Kveton, Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernoncourt, Joe Barrow, Tong Yu, Sungchul Kim, Ruiyi Zhang, Jiuxiang Gu, Tyler Derr, Hongjie Chen, Junda Wu, Xiang Chen, Zichao Wang, Subrata Mitra, Nedim Lipka, Nesreen Ahmed, Yu Wang Title: Personalization of Large Language Models: A Survey Arxiv: http://arxiv.org/abs/2411.00027v1 Abstract: Personalization of Large Language Models (LLMs) has recently become increasi...

Nov 05, 2024•26 min•Ep. 19

Constant Acceleration Flow

🤗 Paper Upvotes: 14 | cs.LG, cs.AI, cs.CV Authors: Dogyun Park, Sojin Lee, Sihyeon Kim, Taehoon Lee, Youngjoon Hong, Hyunwoo J. Kim Title: Constant Acceleration Flow Arxiv: http://arxiv.org/abs/2411.00322v1 Abstract: Rectified flow and reflow procedures have significantly advanced fast generation by progressively straightening ordinary differential equation (ODE) flows. They operate under the assumption that image and noise pairs, known as couplings, can be approximated by straight trajectories...

Nov 05, 2024•21 min•Ep. 18

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

🤗 Paper Upvotes: 13 | cs.CV, cs.AI, cs.CL Authors: Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan Title: TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Arxiv: http://arxiv.org/abs/2410.23266v1 Abstract: Existing benchmarks often highlight the remarkable performance achieved by state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal context for video understanding. However, how well do ...

Nov 05, 2024•24 min•Ep. 17

Randomized Autoregressive Visual Generation

🤗 Paper Upvotes: 10 | cs.CV Authors: Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen Title: Randomized Autoregressive Visual Generation Arxiv: http://arxiv.org/abs/2411.00776v1 Abstract: This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive train...

Nov 05, 2024•20 min•Ep. 16

Survey of User Interface Design and Interaction Techniques in Generative AI Applications

🤗 Paper Upvotes: 8 | cs.HC, cs.AI, cs.CL, cs.LG Authors: Reuben Luera, Ryan A. Rossi, Alexa Siu, Franck Dernoncourt, Tong Yu, Sungchul Kim, Ruiyi Zhang, Xiang Chen, Hanieh Salehy, Jian Zhao, Samyadeep Basu, Puneet Mathur, Nedim Lipka Title: Survey of User Interface Design and Interaction Techniques in Generative AI Applications Arxiv: http://arxiv.org/abs/2410.22370v1 Abstract: The applications of generative AI have become extremely impressive, and the interplay between users and AI is even mor...

Nov 05, 2024•24 min•Ep. 15

Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation

🤗 Paper Upvotes: 8 | cs.LG, cs.AI, cs.CL, I.2.6; I.2.7 Authors: Bohan Lyu, Yadi Cao, Duncan Watson-Parris, Leon Bergen, Taylor Berg-Kirkpatrick, Rose Yu Title: Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation Arxiv: http://arxiv.org/abs/2411.00412v1 Abstract: Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but often produce hallucinations for complex ones. While integrating LLMs with tool...

Nov 05, 2024•21 min•Ep. 14

In-Context LoRA for Diffusion Transformers

🤗 Paper Upvotes: 7 | cs.CV, cs.GR Authors: Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, Jingren Zhou Title: In-Context LoRA for Diffusion Transformers Arxiv: http://arxiv.org/abs/2410.23775v2 Abstract: Recent research arXiv:2410.15027 has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity...

Nov 05, 2024•20 min•Ep. 13

Physics in Next-token Prediction

🤗 Paper Upvotes: 7 | cs.LG, cs.AI Authors: Hongjun An, Yiliang Song, Xuelong Li Title: Physics in Next-token Prediction Arxiv: http://arxiv.org/abs/2411.00660v1 Abstract: We discovered the underlying physics in Next-token Prediction (NTP). We identified the law of information conservation within NTP and proposed the First Law of Information Capacity (IC-1), demonstrating that the essence of intelligence emergence in auto-regressive models is fundamentally a process of information transfer. We a...

Nov 05, 2024•19 min•Ep. 12

CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes

🤗 Paper Upvotes: 5 | cs.CV Authors: Yang Liu, Chuanchen Luo, Zhongkai Mao, Junran Peng, Zhaoxiang Zhang Title: CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes Arxiv: http://arxiv.org/abs/2411.00771v1 Abstract: Recently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction, manifesting efficient and high-fidelity novel view synthesis. However, accurately representing surfaces, especially in large and complex scenarios, remains a...

Nov 05, 2024•20 min•Ep. 11

Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

🤗 Daily Paper Upvotes: 57 Authors: Viacheslav Surkov, Chris Wendler, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gulcehre Categories: cs.LG, cs.AI, cs.CV Arxiv: http://arxiv.org/abs/2410.22366v1 Title: Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders Abstract: Sparse autoencoders (SAEs) have become a core ingredient in the reverse engineering of large-language models (LLMs). For LLMs, they have been shown to decompose intermediate representations tha...

Nov 03, 2024•23 min•Ep. 10

What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

🤗 Daily Paper Upvotes: 45 Authors: Ming Li, Yanhong Li, Tianyi Zhou Categories: cs.CL, cs.AI, cs.LG Arxiv: http://arxiv.org/abs/2410.23743v1 Title: What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective Abstract: What makes a difference in the post-training of LLMs? We investigate the training patterns of different layers in large language models (LLMs), through the lens of gradient, when training with different responses and initial models. We are specific...

Nov 03, 2024•21 min•Ep. 9

A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents

🤗 Daily Paper Upvotes: 20 Authors: Ankan Mullick, Sombit Bose, Abhilash Nandy, Gajula Sai Chaitanya, Pawan Goyal Categories: cs.CL, cs.IR Arxiv: http://arxiv.org/abs/2410.22476v1 Title: A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents Abstract: In task-oriented dialogue systems, intent detection is crucial for interpreting user queries and providing appropriate responses. Existing research primarily addresses simple queries with a single int...

Nov 03, 2024•22 min•Ep. 8

Language Models can Self-Lengthen to Generate Long Texts

🤗 Daily Paper Upvotes: 14 Authors: Shanghaoran Quan, Tianyi Tang, Bowen Yu, An Yang, Dayiheng Liu, Bofei Gao, Jianhong Tu, Yichang Zhang, Jingren Zhou, Junyang Lin Categories: cs.CL Arxiv: http://arxiv.org/abs/2410.23933v1 Title: Language Models can Self-Lengthen to Generate Long Texts Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to process long contexts, yet a notable gap remains in generating long, aligned outputs. This limitation ste...

Nov 03, 2024•20 min•Ep. 7

Constraint Back-translation Improves Complex Instruction Following of Large Language Models

🤗 Daily Paper Upvotes: 12 Authors: Yunjia Qi, Hao Peng, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li Categories: cs.CL, cs.AI Arxiv: http://arxiv.org/abs/2410.24175v1 Title: Constraint Back-translation Improves Complex Instruction Following of Large Language Models Abstract: Large language models (LLMs) struggle to follow instructions with complex constraints in format, length, etc. Following the conventional instruction-tuning practice, previous works conduct post-training on complex instruction-r...

Nov 03, 2024•20 min•Ep. 6

BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

🤗 Daily Paper Upvotes: 11 Authors: Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu Categories: cs.CL, cs.AI, cs.CV, cs.LG Arxiv: http://arxiv.org/abs/2410.23918v1 Title: BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments Abstract: Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM c...

Nov 03, 2024•18 min•Ep. 5

SelfCodeAlign: Self-Alignment for Code Generation

🤗 Daily Paper Upvotes: 11 Authors: Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro von Werra, Arjun Guha, Lingming Zhang Categories: cs.CL, cs.LG, cs.SE Arxiv: http://arxiv.org/abs/2410.24198v1 Title: SelfCodeAlign: Self-Alignment for Code Generation Abstract: Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of large language models (LLMs) to follow human instructions. We propose SelfCodeA...

Nov 03, 2024•19 min•Ep. 4

Learning Video Representations without Natural Videos

🤗 Daily Paper Upvotes: 10 Authors: Xueyang Yu, Xinlei Chen, Yossi Gandelsman Categories: cs.CV Arxiv: http://arxiv.org/abs/2410.24213v1 Title: Learning Video Representations without Natural Videos Abstract: In this paper, we show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural v...

Nov 03, 2024•23 min•Ep. 3

AAAR-1.0: Assessing AI's Potential to Assist Research

🤗 Daily Paper Upvotes: 10 Authors: Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, Wenpeng Yin Categories: cs.CL Arxiv: http://arxiv.org/abs/2410.22394v1 Title: AAAR-1.0: Assessing AI's Potential to Assist Research Abstract: Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facili...

Nov 03, 2024•22 min•Ep. 2

BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays

🤗 Daily Paper Upvotes: 7 Authors: Yang Zhou, Tan Li Hui Faith, Yanyu Xu, Sicong Leng, Xinxing Xu, Yong Liu, Rick Siow Mong Goh Categories: cs.CV Arxiv: http://arxiv.org/abs/2410.21969v1 Title: BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays Abstract: Medical Vision-Language Pretraining (MedVLP) shows promise in learning generalizable and transferable visual representations from paired and unpaired medical images and reports. MedVLP can provide usefu...

Nov 03, 2024•22 min•Ep. 1

← Prev

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android