🤗 Upvotes: 26 | cs.CL Authors: An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, Zipeng Zhang Title: Qwen2.5-1M Technical Report Arxiv: http://arxiv.org/abs/2501.15383v1 Abstract: We introduce Qwen2.5-1M, a series of models that exte...
Jan 29, 2025•24 min•Ep. 441
🤗 Upvotes: 13 | cs.CL Authors: Lin Yueyu, Li Zhiyuan, Peter Yue, Liu Xiao Title: ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer Arxiv: http://arxiv.org/abs/2501.15570v1 Abstract: As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressive...
Jan 29, 2025•21 min•Ep. 440
🤗 Upvotes: 13 | cs.LG, cs.AI Authors: Scott Fujimoto, Pierluca D'Oro, Amy Zhang, Yuandong Tian, Michael Rabbat Title: Towards General-Purpose Model-Free Reinforcement Learning Arxiv: http://arxiv.org/abs/2501.16142v1 Abstract: Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice however, RL algorithms are often tailored to specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices. Recently, powerful model-based RL methods...
Jan 29, 2025•21 min•Ep. 439
🤗 Upvotes: 11 | cs.SD, cs.CL, eess.AS Authors: Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu Title: Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation Arxiv: http://arxiv.org/abs/2501.15907v1 Abstract: Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short ...
Jan 29, 2025•22 min•Ep. 438
🤗 Upvotes: 9 | cs.CV, cs.AI Authors: Chuanyang Zheng Title: iFormer: Integrating ConvNet and Transformer for Mobile Application Arxiv: http://arxiv.org/abs/2501.15369v1 Abstract: We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are ...
Jan 29, 2025•24 min•Ep. 437
🤗 Upvotes: 7 | cs.CV, cs.AI, cs.LG, q-bio.NC Authors: Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper Title: Are Vision Language Models Texture or Shape Biased and Can We Steer Them? Arxiv: http://arxiv.org/abs/2403.09193v1 Abstract: Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classi...
Jan 29, 2025•25 min•Ep. 436
🤗 Upvotes: 5 | cs.LG Authors: Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré, Azalia Mirhoseini Title: CodeMonkeys: Scaling Test-Time Compute for Software Engineering Arxiv: http://arxiv.org/abs/2501.14723v1 Abstract: Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem...
Jan 29, 2025•23 min•Ep. 435
🤗 Upvotes: 4 | cs.LG, cs.AI Authors: Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak Title: Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models Arxiv: http://arxiv.org/abs/2501.12370v2 Abstract: Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the num...
Jan 29, 2025•21 min•Ep. 434
🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Jessica P. Wang, Pawan Kumar, Oleksandr Pokutnyi, Robert Gerbicz, Serguei Popov, John-Clark Levin, Mstyslav Kazakov, Joh...
Jan 28, 2025•23 min•Ep. 433
🤗 Upvotes: 26 | cs.IR, cs.CL Authors: Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, Furu Wei Title: Chain-of-Retrieval Augmented Generation Arxiv: http://arxiv.org/abs/2501.14342v1 Abstract: This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer. Conventional RAG methods usually perform a single retrieval step before the generation process, which limits their effectiv...
Jan 28, 2025•23 min•Ep. 432
🤗 Upvotes: 22 | cs.CL, cs.AI Authors: Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, Guangtao Zhai Title: Redundancy Principles for MLLMs Benchmarks Arxiv: http://arxiv.org/abs/2501.13953v1 Abstract: With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redund...
Jan 28, 2025•22 min•Ep. 431
🤗 Upvotes: 13 | cs.CL, cs.AI, cs.LG Authors: Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin Title: RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques Arxiv: http://arxiv.org/abs/2501.14492v1 Abstract: Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and s...
Jan 28, 2025•24 min•Ep. 430
🤗 Upvotes: 7 | cs.LG, cs.AI Authors: Micah Rentschler, Jesse Roberts Title: RL + Transformer = A General-Purpose Problem Solver Arxiv: http://arxiv.org/abs/2501.14176v1 Abstract: What if artificial intelligence could not only solve problems for which it was trained but also learn to teach itself to solve new problems (i.e., meta-learn)? In this study, we demonstrate that a pre-trained transformer fine-tuned with reinforcement learning over multiple episodes develops the ability to solve problem...
Jan 28, 2025•24 min•Ep. 429
🤗 Upvotes: 5 | cs.CV, cs.GR Authors: Shaofei Wang, Tomas Simon, Igor Santesteban, Timur Bagautdinov, Junxuan Li, Vasu Agrawal, Fabian Prada, Shoou-I Yu, Pace Nalbone, Matt Gramlich, Roman Lubachersky, Chenglei Wu, Javier Romero, Jason Saragih, Michael Zollhoefer, Andreas Geiger, Siyu Tang, Shunsuke Saito Title: Relightable Full-Body Gaussian Codec Avatars Arxiv: http://arxiv.org/abs/2501.14726v1 Abstract: We propose Relightable Full-Body Gaussian Codec Avatars, a new approach for modeling relig...
Jan 28, 2025•21 min•Ep. 428
🤗 Upvotes: 4 | cs.CL, cs.AI Authors: Sara Kothari, Ayush Gupta Title: Question Answering on Patient Medical Records with Private Fine-Tuned LLMs Arxiv: http://arxiv.org/abs/2501.13687v1 Abstract: Healthcare systems continuously generate vast amounts of electronic health records (EHRs), commonly stored in the Fast Healthcare Interoperability Resources (FHIR) standard. Despite the wealth of information in these records, their complexity and volume make it difficult for users to retrieve and inter...
Jan 28, 2025•22 min•Ep. 427
🤗 Upvotes: 3 | cs.CV Authors: Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S. Khan, Salman Khan Title: GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing Arxiv: http://arxiv.org/abs/2501.13925v1 Abstract: Recent advances in large multimodal models (LMMs) have recognized fine-grained grounding as an imperative factor of visual understanding and dialogue. However, the benefits of such representation in LMMs are limited to the natural image domain, and these models pe...
Jan 28, 2025•23 min•Ep. 426
🤗 Upvotes: 2 | cs.CV Authors: Yuning Cui, Syed Waqas Zamir, Salman Khan, Alois Knoll, Mubarak Shah, Fahad Shahbaz Khan Title: AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation Arxiv: http://arxiv.org/abs/2403.14614v1 Abstract: In the image acquisition process, various forms of degradation, including noise, haze, and rain, are frequently introduced. These degradations typically arise from the inherent limitations of cameras or unfavorable ambient conditions. To rec...
Jan 28, 2025•21 min•Ep. 425
🤗 Upvotes: 2 | cs.CV Authors: Yang You, Yixin Li, Congyue Deng, Yue Wang, Leonidas Guibas Title: Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning Arxiv: http://arxiv.org/abs/2411.19458v1 Abstract: Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear....
Jan 28, 2025•24 min•Ep. 424
🤗 Upvotes: 46 | cs.LG, cs.AI, cs.MA, I.2.11 Authors: Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev Title: SRMT: Shared Memory for Multi-agent Lifelong Pathfinding Arxiv: http://arxiv.org/abs/2501.13200v1 Abstract: Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. T...
Jan 25, 2025•24 min•Ep. 423
🤗 Upvotes: 33 | cs.CL Authors: Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li, Guoshuai Zhao, Jui-Hao Chiang, Zhongxin Guo, Chen Lin, Kun Kuang, Wenjie Li, Yelong Shen, Jian Jiao, Peng Cheng, Mao Yang Title: Sigma: Differential Rescaling of Query, Key and Value for Efficient...
Jan 25, 2025•21 min•Ep. 422
🤗 Upvotes: 30 | cs.CV, cs.AI, cs.GR, cs.LG Authors: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang Title: Improving Video Generation with Human Feedback Arxiv: http://arxiv.org/abs/2501.13918v1 Abstract: Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misa...
Jan 25, 2025•24 min•Ep. 421
🤗 Upvotes: 15 | cs.CV, cs.AI, cs.CL, cs.LG, cs.RO Authors: Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy Title: Temporal Preference Optimization for Long-Form Video Understanding Arxiv: http://arxiv.org/abs/2501.13919v1 Abstract: Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization...
Jan 25, 2025•25 min•Ep. 420
🤗 Upvotes: 14 | cs.CV, cs.AI, cs.CL Authors: Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng Title: Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step Arxiv: http://arxiv.org/abs/2501.13926v1 Abstract: Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying...
Jan 25, 2025•21 min•Ep. 419
🤗 Upvotes: 10 | cs.CV, cs.CL Authors: Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu Title: Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Arxiv: http://arxiv.org/abs/2501.13826v1 Abstract: Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating...
Jan 25, 2025•21 min•Ep. 418
🤗 Upvotes: 8 | cs.CV Authors: Xiaowen Li, Haolan Xue, Peiran Ren, Liefeng Bo Title: DiffuEraser: A Diffusion Model for Video Inpainting Arxiv: http://arxiv.org/abs/2501.10018v1 Abstract: Recent video inpainting algorithms integrate flow-based pixel propagation with transformer-based generation to leverage optical flow for restoring textures and objects using information from neighboring frames, while completing masked regions through visual Transformers. However, these approaches often encounte...
Jan 25, 2025•22 min•Ep. 417
🤗 Upvotes: 8 | cs.CV, cs.CL, cs.LG Authors: Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li Title: IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models Arxiv: http://arxiv.org/abs/2501.13920v1 Abstract: With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abi...
Jan 25, 2025•29 min•Ep. 416
🤗 Upvotes: 7 | cs.LG, cs.AI Authors: Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang Title: Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback Arxiv: http://arxiv.org/abs/2501.10799v1 Abstract: Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought p...
Jan 25, 2025•21 min•Ep. 415
🤗 Upvotes: 5 | cs.CV, cs.AI, cs.LG Authors: Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng Title: One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt Arxiv: http://arxiv.org/abs/2501.13554v1 Abstract: Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for ...
Jan 25, 2025•22 min•Ep. 414
🤗 Upvotes: 109 | cs.CL, cs.AI, cs.LG Authors: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting C...
Jan 24, 2025•21 min•Ep. 413
🤗 Upvotes: 44 | cs.CV Authors: Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao Title: VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Arxiv: http://arxiv.org/abs/2501.13106v2 Abstract: In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for image and video understanding. The core design philoso...
Jan 24, 2025•23 min•Ep. 412