π€ Upvotes: 67 | cs.CV Authors: Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang Title: Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning Arxiv: http://arxiv.org/abs/2505.03318v1 Abstract: Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging...
May 08, 2025β’22 minβ’Ep. 741
π€ Upvotes: 63 | cs.LG, cs.AI, cs.CL Authors: Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang Title: Absolute Zero: Reinforced Self-play Reasoning with Zero Data Arxiv: http://arxiv.org/abs/2505.03335v2 Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works tha...
May 08, 2025β’25 minβ’Ep. 740
π€ Upvotes: 23 | cs.CL, cs.AI, cs.LG, I.2.7 Authors: Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah Title: RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale Arxiv: http://arxiv.org/abs/2505.03005v1 Abstract: We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converte...
May 08, 2025β’21 minβ’Ep. 739
π€ Upvotes: 21 | cs.CV, cs.AI, cs.MM Authors: Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang Title: FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios Arxiv: http://arxiv.org/abs/2505.03730v1 Abstract: Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, sk...
May 08, 2025β’20 minβ’Ep. 738
π€ Upvotes: 56 | cs.AI, cs.CL, cs.SD Authors: Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu Title: Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Arxiv: http://arxiv.org/abs/2505.02707v1 Abstract: A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, ...
May 07, 2025β’23 minβ’Ep. 737
π€ Upvotes: 48 | cs.CL, cs.AI, cs.LG Authors: Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji Title: RM-R1: Reward Modeling as Reasoning Arxiv: http://arxiv.org/abs/2505.02387v1 Abstract: Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM)...
May 07, 2025β’23 minβ’Ep. 736
π€ Upvotes: 44 | cs.CL, cs.AI, cs.LG, I.2.7; I.2.6; I.2.3; I.7 Authors: Roman Abramov, Felix Steinbauer, Gjergji Kasneci Title: Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers Arxiv: http://arxiv.org/abs/2504.20752v1 Abstract: Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated t...
May 07, 2025β’21 minβ’Ep. 735
π€ Upvotes: 30 | cs.LG, stat.ML Authors: Essential AI, :, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, Ashish Vaswani Title: Practical Efficiency of Muon for Pretraining Arxiv: http://arxiv.org/abs/2505.02222v1 A...
May 07, 2025β’23 minβ’Ep. 734
π€ Upvotes: 24 | cs.CV Authors: Ziyang Xu, Kangsheng Duan, Xiaolei Shen, Zhifeng Ding, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang Title: PixelHacker: Image Inpainting with Structural and Semantic Consistency Arxiv: http://arxiv.org/abs/2504.20438v2 Abstract: Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demo...
May 06, 2025β’18 minβ’Ep. 733
π€ Upvotes: 31 | cs.CV Authors: Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, Xihui Liu Title: A Survey of Interactive Generative Video Arxiv: http://arxiv.org/abs/2504.21853v1 Abstract: Interactive Generative Video (IGV) has emerged as a crucial technology in response to the growing demand for high-quality, interactive video content across various domains. In this paper, we define IGV as a technology that combines generative capabilities to...
May 03, 2025β’23 minβ’Ep. 732
π€ Upvotes: 27 | cs.CL, cs.AI, cs.LG Authors: Wenkai Yang, Jingwen Chen, Yankai Lin, Ji-Rong Wen Title: DeepCritic: Deliberate Critique with Large Language Models Arxiv: http://arxiv.org/abs/2505.00662v1 Abstract: As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on study...
May 03, 2025β’22 minβ’Ep. 731
π€ Upvotes: 44 | cs.CL, cs.AI Authors: Zeina Aldallal, Sara Chrouf, Khalil Hennara, Mohamed Motaism Hamed, Muhammad Hreden, Safwan AlModhayan Title: Sadeed: Advancing Arabic Diacritization Through Small Language Model Arxiv: http://arxiv.org/abs/2504.21635v1 Abstract: Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language...
May 02, 2025β’22 minβ’Ep. 730
π€ Upvotes: 27 | cs.CL, cs.AI, cs.IR Authors: Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, Zhicheng Dou Title: WebThinker: Empowering Large Reasoning Models with Deep Research Capability Arxiv: http://arxiv.org/abs/2504.21776v1 Abstract: Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, kn...
May 02, 2025β’21 minβ’Ep. 729
π€ Upvotes: 24 | cs.CL Authors: Haoran Xu, Baolin Peng, Hany Awadalla, Dongdong Chen, Yen-Chun Chen, Mei Gao, Young Jin Kim, Yunsheng Li, Liliang Ren, Yelong Shen, Shuohang Wang, Weijian Xu, Jianfeng Gao, Weizhu Chen Title: Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math Arxiv: http://arxiv.org/abs/2504.21233v1 Abstract: Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) by training them to explicitly ...
May 02, 2025β’20 minβ’Ep. 728
π€ Upvotes: 22 | cs.CV Authors: Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Olga Russakovsky Title: COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning Arxiv: http://arxiv.org/abs/2504.21850v1 Abstract: Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be par...
May 02, 2025β’19 minβ’Ep. 727
π€ Upvotes: 49 | cs.LG, cs.AI, cs.CL Authors: Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen Title: Reinforcement Learning for Reasoning in Large Language Models with One Training Example Arxiv: http://arxiv.org/abs/2504.20571v1 Abstract: We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizin...
May 01, 2025β’22 minβ’Ep. 726
π€ Upvotes: 44 | cs.CL, cs.AI, cs.CV, cs.IR, cs.LG Authors: Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang Title: UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities Arxiv: http://arxiv.org/abs/2504.20734v1 Abstract: Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG app...
May 01, 2025β’22 minβ’Ep. 725
π€ Upvotes: 36 | cs.AI, cs.CL, cs.IR, cs.LG Authors: Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen-tau Yih, Pang Wei Koh, Luke Zettlemoyer Title: ReasonIR: Training Retrievers for Reasoning Tasks Arxiv: http://arxiv.org/abs/2504.20595v1 Abstract: We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part becaus...
May 01, 2025β’21 minβ’Ep. 724
π€ Upvotes: 36 | cs.AI, cs.CL, cs.LG, stat.ME Authors: Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet ΓstΓΌn, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker Title: The Leaderboard Illusion Arxiv: http://arxiv.org/abs/2504.20879v1 Abstract: Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot...
May 01, 2025β’21 minβ’Ep. 723
π€ Upvotes: 28 | cs.CL Authors: Zae Myung Kim, Chanwoo Park, Vipul Raheja, Dongyeop Kang Title: Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models Arxiv: http://arxiv.org/abs/2504.20157v1 Abstract: Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We ...
May 01, 2025β’21 minβ’Ep. 722
π€ Upvotes: 22 | cs.CV Authors: Haofan Wang, Yujia Xu, Yimeng Li, Junchen Li, Chaowei Zhang, Jing Wang, Kejia Yang, Zhibo Chen Title: RepText: Rendering Visual Text via Replicating Arxiv: http://arxiv.org/abs/2504.19724v1 Abstract: Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address t...
Apr 30, 2025β’22 minβ’Ep. 721
π€ Upvotes: 127 | cs.CV, cs.AI, cs.CL, cs.LG, cs.MM Authors: Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan Title: Towards Understanding Camera Motions in Any Video Arxiv: http://arxiv.org/abs/2504.15376v1 Abstract: We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench cons...
Apr 29, 2025β’22 minβ’Ep. 720
π€ Upvotes: 43 | cs.CV Authors: Chris, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou Title: Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning Arxiv: http://arxiv.org/abs/2504.16656v2 Abstract: We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learn...
Apr 29, 2025β’21 minβ’Ep. 719
π€ Upvotes: 25 | cs.CL, cs.LG Authors: Hongyu Wang, Shuming Ma, Furu Wei Title: BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs Arxiv: http://arxiv.org/abs/2504.18415v1 Abstract: Efficient deployment of 1-bit Large Language Models (LLMs) is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-...
Apr 29, 2025β’21 minβ’Ep. 718
π€ Upvotes: 55 | cs.CV Authors: Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang Title: Step1X-Edit: A Practical Framework for General Image Editing Arxiv: http://arxiv.org/abs/2504.17761v1 Abstract: In recent years, image editing models have witnessed remarkable and...
Apr 26, 2025β’21 minβ’Ep. 717
π€ Upvotes: 50 | cs.CL Authors: Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang Title: Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning Arxiv: http://arxiv.org/abs/2504.17192v1 Abstract: Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at u...
Apr 26, 2025β’22 minβ’Ep. 716
π€ Upvotes: 47 | cs.CV Authors: Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Itay Laish, Dani Lischinski, Idan Szpektor Title: RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation Arxiv: http://arxiv.org/abs/2504.17502v1 Abstract: Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a r...
Apr 26, 2025β’20 minβ’Ep. 715
π€ Upvotes: 28 | cs.CV Authors: Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng Title: Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs Arxiv: http://arxiv.org/abs/2504.17432v1 Abstract: The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its eff...
Apr 26, 2025β’25 minβ’Ep. 714
π€ Upvotes: 39 | cs.CV, cs.AI Authors: Fulong Ye, Miao Hua, Pengze Zhang, Xinghui Li, Qichao Sun, Songtao Zhao, Qian He, Xinglong Wu Title: DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning Arxiv: http://arxiv.org/abs/2504.14509v2 Abstract: In this paper, we introduce DreamID, a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. Unlike the typical face swapping...
Apr 25, 2025β’21 minβ’Ep. 713
π€ Upvotes: 27 | cs.CL, cs.AI, cs.LG Authors: Sungjun Han, Juyoung Suk, Suyeong An, Hyungguk Kim, Kyuseok Kim, Wonsuk Yang, Seungtaek Choi, Jamin Shin Title: Trillion 7B Technical Report Arxiv: http://arxiv.org/abs/2504.15431v1 Abstract: We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and J...
Apr 25, 2025β’26 minβ’Ep. 712