Daily Paper Cast

Jingwen Liang, Gengyu Wang•dailypapercast.transistor.fm

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

Last refreshed: July 27th, 2025 at 9:35 AM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Direct Preference Optimization Using Sparse Feature-Level Constraints

🤗 Paper Upvotes: 10 | cs.AI, cs.CL Authors: Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang Title: Direct Preference Optimization Using Sparse Feature-Level Constraints Arxiv: http://arxiv.org/abs/2411.07618v1 Abstract: The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Prefe...

Nov 15, 2024•21 min•Ep. 81

CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

🤗 Paper Upvotes: 8 | cs.CL Authors: Wissam Antoun, Francis Kulumba, Rian Touchent, Éric de la Clergerie, Benoît Sagot, Djamé Seddah Title: CamemBERT 2.0: A Smarter French Language Model Aged to Perfection Arxiv: http://arxiv.org/abs/2411.08868v1 Abstract: French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due ...

Nov 15, 2024•24 min•Ep. 80

Can sparse autoencoders be used to decompose and interpret steering vectors?

🤗 Paper Upvotes: 6 | cs.LG, cs.AI, cs.CL Authors: Harry Mayne, Yushi Yang, Adam Mahdi Title: Can sparse autoencoders be used to decompose and interpret steering vectors? Arxiv: http://arxiv.org/abs/2411.08790v1 Abstract: Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE...

Nov 15, 2024•22 min•Ep. 79

PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

🤗 Paper Upvotes: 5 | cs.AI, cs.MM, cs.SD, eess.AS Authors: Yungang Yi, Weihua Li, Matthew Kuo, Quan Bai Title: PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation Arxiv: http://arxiv.org/abs/2411.08307v1 Abstract: Music generation has progressed significantly, especially in the domain of audio generation. However, generating symbolic music that is both long-structured and expressive remains a significant challenge. In this paper, we...

Nov 15, 2024•19 min•Ep. 78

SAMPart3D: Segment Any Part in 3D Objects

🤗 Paper Upvotes: 18 | cs.CV Authors: Yunhan Yang, Yukun Huang, Yuan-Chen Guo, Liangjun Lu, Xiaoyang Wu, Edmund Y. Lam, Yan-Pei Cao, Xihui Liu Title: SAMPart3D: Segment Any Part in 3D Objects Arxiv: http://arxiv.org/abs/2411.07184v1 Abstract: 3D part segmentation is a crucial and challenging task in 3D perception, playing a vital role in applications such as robotics, 3D generation, and 3D editing. Recent methods harness the powerful Vision Language Models (VLMs) for 2D-to-3D knowledge distillat...

Nov 14, 2024•21 min•Ep. 77

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

🤗 Paper Upvotes: 14 | cs.CV, cs.AI, cs.CL Authors: Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, Chong Ruan Title: JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation Arxiv: http://arxiv.org/abs/2411.07975v1 Abstract: We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model....

Nov 14, 2024•23 min•Ep. 76

Stronger Models are NOT Stronger Teachers for Instruction Tuning

🤗 Paper Upvotes: 13 | cs.AI, cs.CL Authors: Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran Title: Stronger Models are NOT Stronger Teachers for Instruction Tuning Arxiv: http://arxiv.org/abs/2411.07133v2 Abstract: Instruction tuning has been widely adopted to ensure large language models (LLMs) follow user instructions effectively. The resulting instruction-following capabilities of LLMs heavily rely on the instruction datasets used for tuning. Recently, synthetic in...

Nov 14, 2024•28 min•Ep. 75

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

🤗 Paper Upvotes: 11 | cs.CV, cs.AI Authors: Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park, Etash Guha, Silvio Savarese, Ludwig Schmidt, Yejin Choi, Caiming Xiong, Ran Xu Title: BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Arxiv: http://arxiv.org/abs/2411.07461v1 Abstract: We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factu...

Nov 14, 2024•21 min•Ep. 74

Scaling Properties of Diffusion Models for Perceptual Tasks

🤗 Paper Upvotes: 7 | cs.CV, cs.AI Authors: Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, Jitendra Malik Title: Scaling Properties of Diffusion Models for Perceptual Tasks Arxiv: http://arxiv.org/abs/2411.08034v2 Abstract: In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-...

Nov 14, 2024•25 min•Ep. 73

Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings

🤗 Paper Upvotes: 5 | cs.CV, cs.AI, cs.LG Authors: Aditya Sanghi, Aliasghar Khani, Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani Title: Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings Arxiv: http://arxiv.org/abs/2411.08017v1 Abstract: Large-scale 3D generative models require substantial computational resources yet often fall short in capturing fine details and complex geometries at high r...

Nov 14, 2024•22 min•Ep. 72

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

🤗 Paper Upvotes: 44 | cs.CV, cs.AI, cs.GR, cs.LG Authors: Yoad Tewel, Rinon Gal, Dvir Samuel, Yuval Atzmon, Lior Wolf, Gal Chechik Title: Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models Arxiv: http://arxiv.org/abs/2411.07232v2 Abstract: Adding Object into images based on text instructions is a challenging task in semantic image editing, requiring a balance between preserving the original scene and seamlessly integrating the new object in a fitting location. Des...

Nov 13, 2024•24 min•Ep. 71

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

🤗 Paper Upvotes: 39 | cs.CV, cs.AI Authors: Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, Wenhu Chen Title: OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision Arxiv: http://arxiv.org/abs/2411.07199v1 Abstract: Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life app...

Nov 13, 2024•20 min•Ep. 70

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

🤗 Paper Upvotes: 30 | cs.CL Authors: Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Hui Huang, Weixun Wang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, Xuepeng Liu, Dekai Sun, Wenbo Su, Bo Zheng Title: Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models Arxiv: http://arxiv.org/abs/2411.07140v1 Abstract: New LLM evaluation benchmarks are important to align with the rapid development of Large Language Models (LLMs). In this work, we present Chinese SimpleQA, th...

Nov 13, 2024•21 min•Ep. 69

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

🤗 Paper Upvotes: 28 | cs.CL Authors: Yew Ken Chia, Liying Cheng, Hou Pong Chan, Chaoqun Liu, Maojia Song, Sharifah Mahani Aljunied, Soujanya Poria, Lidong Bing Title: M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework Arxiv: http://arxiv.org/abs/2411.06176v1 Abstract: The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and dive...

Nov 13, 2024•21 min•Ep. 68

Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

🤗 Paper Upvotes: 21 | cs.CV, cs.LG Authors: NVIDIA, :, Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge, Siddharth Gururani, Jacob Huffman, Ronald Isaac, Pooya Jannaty, Tero Karras, Grace Lam, J. P. Lewis, Aaron Licata, Yen-Chen Lin, Ming-Yu Liu, Qianli Ma, Arun Mallya, Ashlee Martino-Tarr, Doug Mendez, Seungjun Nah, Chris Pruett, Fitsum Reda, Jiaming Song, Ting-Chun Wang, Fangyin Wei, Xiaohui Zeng, Yu Zeng, Qinsheng Zhang Title: Edify Image: High-Quality ...

Nov 13, 2024•25 min•Ep. 67

GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models

🤗 Paper Upvotes: 18 | cs.SE, cs.LG Authors: Nizar Islah, Justine Gehring, Diganta Misra, Eilif Muller, Irina Rish, Terry Yue Zhuo, Massimo Caccia Title: GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models Arxiv: http://arxiv.org/abs/2411.05830v1 Abstract: The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent version updates while maintaining compatibility with previous versions. Existi...

Nov 13, 2024•25 min•Ep. 66

Watermark Anything with Localized Messages

🤗 Paper Upvotes: 11 | cs.CV, cs.CR Authors: Tom Sander, Pierre Fernandez, Alain Durmus, Teddy Furon, Matthijs Douze Title: Watermark Anything with Localized Messages Arxiv: http://arxiv.org/abs/2411.07231v1 Abstract: Image watermarking methods are not tailored to handle small watermarked areas. This restricts applications in real-world scenarios where parts of the image may come from different sources or have been edited. We introduce a deep-learning model for localized image watermarking, dubb...

Nov 13, 2024•23 min•Ep. 65

Autoregressive Models in Vision: A Survey

🤗 Paper Upvotes: 3 | cs.CV, cs.CL Authors: Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong Title: Autoregressive Models in Vision: A Survey Arxiv: http://arxiv.org/abs/2411.05902v1 Abstract: Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregr...

Nov 13, 2024•23 min•Ep. 64

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

🤗 Paper Upvotes: 15 | cs.CV, cs.CL Authors: Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu Title: LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation Arxiv: http://arxiv.org/abs/2411.04997v1 Abstract: CLIP is one of the most important multimodal foundational models today. What powers CLIP's capabilities? The rich supervision signals provided by natural language, the carrier of human knowledge, shape...

Nov 12, 2024•25 min•Ep. 63

Balancing Pipeline Parallelism with Vocabulary Parallelism

🤗 Paper Upvotes: 10 | cs.DC Authors: Man Tsung Yeung, Penghui Qi, Min Lin, Xinyi Wan Title: Balancing Pipeline Parallelism with Vocabulary Parallelism Arxiv: http://arxiv.org/abs/2411.05288v1 Abstract: Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and me...

Nov 12, 2024•24 min•Ep. 62

StdGEN: Semantic-Decomposed 3D Character Generation from Single Images

🤗 Paper Upvotes: 10 | cs.CV Authors: Yuze He, Yanning Zhou, Wang Zhao, Zhongkai Wu, Kaiwen Xiao, Wei Yang, Yong-Jin Liu, Xiao Han Title: StdGEN: Semantic-Decomposed 3D Character Generation from Single Images Arxiv: http://arxiv.org/abs/2411.05738v1 Abstract: We present StdGEN, an innovative pipeline for generating semantically decomposed high-quality 3D characters from single images, enabling broad applications in virtual reality, gaming, and filmmaking, etc. Unlike previous methods which strug...

Nov 12, 2024•22 min•Ep. 61

DELIFT: Data Efficient Language model Instruction Fine Tuning

🤗 Paper Upvotes: 5 | cs.CL Authors: Ishika Agarwal, Krishnateja Killamsetty, Lucian Popa, Marina Danilevksy Title: DELIFT: Data Efficient Language model Instruction Fine Tuning Arxiv: http://arxiv.org/abs/2411.04425v2 Abstract: Fine-tuning large language models (LLMs) is essential for enhancing their performance on specific tasks but is often resource-intensive due to redundant or uninformative data. To address this inefficiency, we introduce DELIFT (Data Efficient Language model Instruction Fi...

Nov 12, 2024•21 min•Ep. 60

Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study

🤗 Paper Upvotes: 4 | cs.SE, cs.AI, cs.LG Authors: André Storhaug, Jingyue Li Title: Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study Arxiv: http://arxiv.org/abs/2411.02462v1 Abstract: The advent of large language models (LLMs) like GitHub Copilot has significantly enhanced programmers' productivity, particularly in code generation. However, these models often struggle with real-world tasks without fine-tuning. As LLMs grow larger and more per...

Nov 12, 2024•25 min•Ep. 59

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

🤗 Paper Upvotes: 3 | cs.CV, cs.AI Authors: Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, Curtis Langlotz Title: RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models Arxiv: http://arxiv.org/abs/2411.04097v1 Abstract: Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurio...

Nov 12, 2024•22 min•Ep. 58

The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities

🤗 Paper Upvotes: 3 | cs.CL Authors: Zhaofeng Wu, Xinyan Velocity Yu, Dani Yogatama, Jiasen Lu, Yoon Kim Title: The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities Arxiv: http://arxiv.org/abs/2411.04986v1 Abstract: Modern language models can process inputs across diverse languages and modalities. We hypothesize that models acquire this capability through learning a shared representation space across heterogeneous data types (e.g., different...

Nov 12, 2024•24 min•Ep. 57

Improving the detection of technical debt in Java source code with an enriched dataset

🤗 Paper Upvotes: 2 | cs.SE Authors: Nam Le Hai, Anh M. T. Bui, Phuong T. Nguyen, Davide Di Ruscio, Rick Kazman Title: Improving the detection of technical debt in Java source code with an enriched dataset Arxiv: http://arxiv.org/abs/2411.05457v1 Abstract: Technical debt (TD) is a term used to describe the additional work and costs that emerge when developers have opted for a quick and easy solution to a problem, rather than a more effective and well-designed, but time-consuming approach. Self-A...

Nov 12, 2024•26 min•Ep. 56

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

🤗 Paper Upvotes: 69 | cs.CL, cs.PL Authors: Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, Wei Chu Title: OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models Arxiv: http://arxiv.org/abs/2411.04905v1 Abstract: Large language models (LLMs) for code have become indispensable in various domains, including co...

Nov 09, 2024•23 min•Ep. 55

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

🤗 Paper Upvotes: 50 | cs.CV, cs.AI, cs.GR, cs.LG Authors: David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E. Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, Nataniel Ruiz Title: ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning Arxiv: http://arxiv.org/abs/2411.05003v1 Abstract: Recently, breakthroughs in video modeling have allowed for controllable camera trajectories in generated videos. However, these method...

Nov 09, 2024•20 min•Ep. 54

BitNet a4.8: 4-bit Activations for 1-bit LLMs

🤗 Paper Upvotes: 41 | cs.CL, cs.LG Authors: Hongyu Wang, Shuming Ma, Furu Wei Title: BitNet a4.8: 4-bit Activations for 1-bit LLMs Arxiv: http://arxiv.org/abs/2411.04965v1 Abstract: Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and...

Nov 09, 2024•25 min•Ep. 53

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

🤗 Paper Upvotes: 27 | cs.CV, cs.AI, cs.GR Authors: Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, Yikai Wang Title: DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion Arxiv: http://arxiv.org/abs/2411.04928v1 Abstract: In this paper, we introduce \textbf{DimensionX}, a framework designed to generate photorealistic 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the ...

Nov 09, 2024•23 min•Ep. 52

← Prev Next →

Hosted on Transistor

For the best experience, listen in Metacast app for iOS or Android