🤗 Upvotes: 83 | cs.CL Authors: Sergey Pletenev, Maria Marina, Nikolay Ivanov, Daria Galimzianova, Nikita Krayko, Mikhail Salnikov, Vasily Konovalov, Alexander Panchenko, Viktor Moskvoretskii Title: Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA Arxiv: http://arxiv.org/abs/2505.21115v1 Abstract: Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the tempo...
Jun 10, 2025•22 min•Ep. 891
🤗 Upvotes: 27 | cs.SD, cs.AI, eess.AS Authors: Shunian Chen, Xinyuan Xie, Zheshu Chen, Liyan Zhao, Owen Lee, Zhan Su, Qilin Sun, Benyou Wang Title: FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion Arxiv: http://arxiv.org/abs/2506.01111v1 Abstract: High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily du...
Jun 10, 2025•21 min•Ep. 890
🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL, cs.LG Authors: Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, Furong Huang Title: MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning Arxiv: http://arxiv.org/abs/2506.05523v1 Abstract: Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in thr...
Jun 10, 2025•23 min•Ep. 889
🤗 Upvotes: 25 | cs.CL Authors: Ananth Muppidi, Abhilash Nandy, Sambaran Bandyopadhyay Title: Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs Arxiv: http://arxiv.org/abs/2506.05629v1 Abstract: The performance of large language models in domain-specific tasks necessitates fine-tuning, which is computationally expensive and technically challenging. This paper focuses on parameter-efficient fine-tuning using soft prompting, a promising approach that adapts pre-trained models to...
Jun 10, 2025•20 min•Ep. 888
🤗 Upvotes: 39 | cs.CV Authors: Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang Title: SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training Arxiv: http://arxiv.org/abs/2506.05301v1 Abstract: Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during infere...
Jun 07, 2025•22 min•Ep. 887
🤗 Upvotes: 38 | cs.CL, cs.CV Authors: Zhenran Xu, Xue Yang, Yiyu Wang, Qingli Hu, Zijiao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang Title: ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development Arxiv: http://arxiv.org/abs/2506.05010v1 Abstract: We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and u...
Jun 07, 2025•21 min•Ep. 886
🤗 Upvotes: 32 | cs.LG, cs.CL Authors: Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets Title: Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts Arxiv: http://arxiv.org/abs/2506.05229v1 Abstract: Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. Ho...
Jun 07, 2025•20 min•Ep. 885
🤗 Upvotes: 32 | cs.RO, cs.AI, cs.CV Authors: Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, Shanghang Zhang Title: RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics Arxiv: http://arxiv.org/abs/2506.04308v1 Abstract: Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language mod...
Jun 07, 2025•24 min•Ep. 884
🤗 Upvotes: 30 | cs.CV Authors: Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein Title: Video World Models with Long-term Spatial Memory Arxiv: http://arxiv.org/abs/2506.05284v1 Abstract: Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, le...
Jun 07, 2025•22 min•Ep. 883
🤗 Upvotes: 27 | cs.AI Authors: Mathieu Andreux, Breno Baldas Skuk, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Matthias Brunel, Pierre-Louis Cedoz, Antoine Chassang, Mickaël Chen, Alexandra D. Constantinou, Antoine d'Andigné, Hubert de La Jonquière, Aurélien Delfosse, Ludovic Denoyer, Alexis Deprez, Augustin Derupti, Michael Eickenberg, Mathïs Federico, Charles Kantor, Xavier Koegler, Yann Labbé, Matthew C. H. Lee, Erwan Le Jumeau de Kergaradec, Amir Mahla, Avshalom Manevich, ...
Jun 07, 2025•25 min•Ep. 882
🤗 Upvotes: 24 | cs.CL Authors: Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, Jingren Zhou Title: Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models Arxiv: http://arxiv.org/abs/2506.05176v1 Abstract: In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upo...
Jun 07, 2025•21 min•Ep. 881
🤗 Upvotes: 23 | cs.CV Authors: Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng Title: VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models Arxiv: http://arxiv.org/abs/2505.23656v1 Abstract: Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their...
Jun 07, 2025•23 min•Ep. 880
🤗 Upvotes: 22 | cs.CL, cs.LG Authors: Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal, Elie Bakouch, John David Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, Tom Goldstein, Brian R. Bartoldson, Bhavya Kailkhura, Tyler Murray Title: The Common Pile...
Jun 07, 2025•18 min•Ep. 879
🤗 Upvotes: 21 | cs.CV Authors: Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, Fahad Khan Title: VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos Arxiv: http://arxiv.org/abs/2506.05349v1 Abstract: Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten o...
Jun 07, 2025•21 min•Ep. 878
🤗 Upvotes: 58 | cs.CL Authors: Xiaomi LLM-Core Team, :, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, ...
Jun 06, 2025•19 min•Ep. 877
🤗 Upvotes: 41 | cs.LG, cs.AI, cs.CL, cs.CV Authors: Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, Yu Cheng Title: Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning Arxiv: http://arxiv.org/abs/2506.04207v1 Abstract: Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (M...
Jun 06, 2025•20 min•Ep. 876
🤗 Upvotes: 39 | cs.LG, cs.AI, cs.CL, cs.RO Authors: Anastasiia Ivanova, Eva Bakaeva, Zoya Volovikova, Alexey K. Kovalev, Aleksandr I. Panov Title: AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment Arxiv: http://arxiv.org/abs/2506.04089v1 Abstract: As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge...
Jun 06, 2025•21 min•Ep. 875
🤗 Upvotes: 35 | cs.AR, cs.AI, cs.CL, cs.LG, cs.PL Authors: Ahmed Heakl, Sarim Hashmi, Gustavo Bertolo Stahl, Seung Hun Eddie Han, Salman Khan, Abdulrahman Mahmoud Title: CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark Arxiv: http://arxiv.org/abs/2505.16968v3 Abstract: We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA <--> HIP) and assembly-level (Nvidia SASS <--> AMD RD...
Jun 06, 2025•23 min•Ep. 874
🤗 Upvotes: 30 | cs.CL Authors: Yijun Yang, Zeyu Huang, Wenhao Zhu, Zihan Qiu, Fei Yuan, Jeff Z. Pan, Ivan Titov Title: A Controllable Examination for Long-Context Language Models Arxiv: http://arxiv.org/abs/2506.02921v1 Abstract: Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world and synthetic tasks. Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks are too complex to interpret...
Jun 06, 2025•22 min•Ep. 873
🤗 Upvotes: 25 | cs.CV, cs.CL Authors: Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao Title: MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos Arxiv: http://arxiv.org/abs/2506.04141v1 Abstract: The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mai...
Jun 06, 2025•23 min•Ep. 872
🤗 Upvotes: 23 | cs.CL Authors: Kejian Zhu, Shangqing Tu, Zhuoran Jin, Lei Hou, Juanzi Li, Jun Zhao Title: Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis Arxiv: http://arxiv.org/abs/2506.04142v1 Abstract: The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on construc...
Jun 06, 2025•20 min•Ep. 871
🤗 Upvotes: 23 | cs.CL Authors: Yuhao Wu, Yushi Bai, Zhiqiang Hu, Juanzi Li, Roy Ka-Wei Lee Title: SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models Arxiv: http://arxiv.org/abs/2506.04180v1 Abstract: Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter...
Jun 06, 2025•22 min•Ep. 870
🤗 Upvotes: 144 | cs.CL Authors: Shelly Bensal, Umar Jamil, Christopher Bryant, Melisa Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, Waseem AlShikh Title: Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning Arxiv: http://arxiv.org/abs/2505.24726v1 Abstract: We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrect...
Jun 05, 2025•23 min•Ep. 869
🤗 Upvotes: 51 | cs.AI Authors: Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang Title: VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments Arxiv: http://arxiv.org/abs/2506.02387v1 Abstract: Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenar...
Jun 05, 2025•24 min•Ep. 868
🤗 Upvotes: 49 | cs.CV, cs.AI, cs.CL Authors: Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan Title: UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation Arxiv: http://arxiv.org/abs/2506.03147v2 Abstract: Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image ...
Jun 05, 2025•19 min•Ep. 867
🤗 Upvotes: 46 | cs.LG, cs.CL, cs.CV Authors: Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, Michael Qizhe Shieh Title: SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis Arxiv: http://arxiv.org/abs/2506.02096v1 Abstract: Vision-language models (VLMs) trained via reinforcement learning with verifiable reward (RLVR) have shown notable progress in scaling test-time compute effectively. In this work, we investigate how synthesized RL data can further improve RLVR. To this e...
Jun 05, 2025•18 min•Ep. 866
🤗 Upvotes: 43 | cs.CV, cs.AI Authors: Ai Jian, Weijie Qiu, Xiaokun Wang, Peiyu Wang, Yunzhuo Hao, Jiangbo Pei, Yichen Wei, Yi Peng, Xuchen Song Title: CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs Arxiv: http://arxiv.org/abs/2505.24120v1 Abstract: Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remains inadequately assessed. Current multimodal benchmarks pr...
Jun 05, 2025•21 min•Ep. 865
🤗 Upvotes: 29 | cs.CL, cs.AI, cs.CV Authors: Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Liden, Qingwei Lin, Huan Zhang, Tong Zhang, Jianbing Zhang, Dongmei Zhang, Jianfeng Gao Title: GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents Arxiv: http://arxiv.org/abs/2506.03143v1 Abstract: One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the app...
Jun 05, 2025•22 min•Ep. 864
🤗 Upvotes: 29 | cs.CV, cs.RO Authors: Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, Shenglong Ye, Lewei Lu, Jingbo Wang, Wenhai Wang, Jifeng Dai, Yu Qiao, Rongrong Ji, Xizhou Zhu Title: Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces Arxiv: http://arxiv.org/abs/2506.00123v1 Abstract: The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing ...
Jun 05, 2025•22 min•Ep. 863
🤗 Upvotes: 28 | cs.AI Authors: Shengjia Zhang, Junjie Wu, Jiawei Chen, Changwang Zhang, Xingyu Lou, Wangchunshu Zhou, Sheng Zhou, Can Wang, Jun Wang Title: OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation Arxiv: http://arxiv.org/abs/2506.02397v1 Abstract: Recent advanced large reasoning models (LRMs) leverage extended chain-of-thought (CoT) reasoning to solve complex tasks, achieving state-of-the-art performance. Despite their success, we identify a critical ...
Jun 05, 2025•24 min•Ep. 862