🤗 Upvotes: 4 | cs.CV, cs.AI, cs.CL, cs.LG Authors: Wan-Cyuan Fan, Tanzila Rahman, Leonid Sigal Title: MMFactory: A Universal Solution Search Engine for Vision-Language Tasks Arxiv: http://arxiv.org/abs/2412.18072v1 Abstract: With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single ...
Dec 28, 2024•21 min•Ep. 291
🤗 Upvotes: 2 | cs.IR, cs.AI Authors: Yucong Luo, Qitao Qin, Hao Zhang, Mingyue Cheng, Ruiran Yan, Kefan Wang, Jie Ouyang Title: Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation Arxiv: http://arxiv.org/abs/2412.18176v1 Abstract: Sequential recommendation (SR) systems have evolved significantly over the past decade, transitioning from traditional collaborative filtering to deep learning approaches and, more recently, to large language models (LL...
Dec 28, 2024•22 min•Ep. 290
🤗 Upvotes: 21 | cs.CV Authors: Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng Chen, Ping Luo Title: DepthLab: From Partial to Complete Arxiv: http://arxiv.org/abs/2412.18153v1 Abstract: Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting ...
Dec 26, 2024•22 min•Ep. 289
🤗 Upvotes: 20 | cs.AI, cs.CL Authors: Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Ning Ding, Youbang Sun, Biqing Qi, Yuchen Fan, Xue Kai Zhu, Bowen Zhou Title: Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization Arxiv: http://arxiv.org/abs/2412.17739v1 Abstract: Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attentio...
Dec 26, 2024•22 min•Ep. 288
🤗 Upvotes: 10 | cs.CV, cs.AI, cs.MM Authors: Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, Xiangyu Yue Title: DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation Arxiv: http://arxiv.org/abs/2412.18597v1 Abstract: Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation ...
Dec 26, 2024•22 min•Ep. 287
🤗 Upvotes: 8 | cs.CL, cs.AI Authors: Łukasz Borchmann Title: In Case You Missed It: ARC 'Challenge' Is Not That Challenging Arxiv: http://arxiv.org/abs/2412.17758v1 Abstract: ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet ...
Dec 26, 2024•24 min•Ep. 286
🤗 Upvotes: 8 | cs.LG Authors: Ziteng Wang, Jianfei Chen, Jun Zhu Title: ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing Arxiv: http://arxiv.org/abs/2412.14711v1 Abstract: Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, ...
Dec 26, 2024•21 min•Ep. 285
🤗 Upvotes: 6 | cs.CL Authors: Aakash Mahalingam, Vinesh Kumar Gande, Aman Chadha, Vinija Jain, Divya Chaudhary Title: SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval Arxiv: http://arxiv.org/abs/2412.15443v1 Abstract: Retrieval-Augmented Generation (RAG) systems have become pivotal in leveraging vast corpora to generate informed and contextually relevant responses, notably reducing hallucinations in Large Language Models. Despite significant advancements, these sy...
Dec 26, 2024•22 min•Ep. 284
🤗 Upvotes: 5 | cs.CV Authors: Minghao Chen, Roman Shapovalov, Iro Laina, Tom Monnier, Jianyuan Wang, David Novotny, Andrea Vedaldi Title: PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models Arxiv: http://arxiv.org/abs/2412.18608v1 Abstract: Text- or image-to-3D generators and 3D scanners can now produce 3D assets with high-quality shapes and textures. These assets typically consist of a single, fused representation, like an implicit neural field, a Gaussian mix...
Dec 26, 2024•26 min•Ep. 283
🤗 Upvotes: 3 | cs.CV, cs.AI Authors: Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin Title: MotiF: Making Text Count in Image Animation with Motion Focal Loss Arxiv: http://arxiv.org/abs/2412.16153v1 Abstract: Text-Image-to-Video (TI2V) generation aims to generate a video from an image following a text description, which is also referred to as text-guided image animation. Most existing methods struggle to generate videos that align well with the text prompts, parti...
Dec 26, 2024•23 min•Ep. 282
🤗 Upvotes: 3 | cs.AI, cs.CL, cs.CY, cs.LG, cs.MM Authors: Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole,...
Dec 26, 2024•25 min•Ep. 281
🤗 Upvotes: 64 | cs.CL, cs.AI Authors: Junyu Luo, Xiao Luo, Kaize Ding, Jingyang Yuan, Zhiping Xiao, Ming Zhang Title: RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response Arxiv: http://arxiv.org/abs/2412.14922v1 Abstract: Supervised fine-tuning (SFT) plays a crucial role in adapting large language models (LLMs) to specific domains or tasks. However, as demonstrated by empirical experiments, the collected data inevitably contains noise in practical applications,...
Dec 25, 2024•22 min•Ep. 280
🤗 Upvotes: 29 | cs.AI, cs.CL, cs.LG Authors: Weihao Zeng, Yuzhen Huang, Lulu Zhao, Yijun Wang, Zifei Shan, Junxian He Title: B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners Arxiv: http://arxiv.org/abs/2412.17256v1 Abstract: In the absence of extensive human-annotated data for complex reasoning tasks, self-improvement -- where models are trained on their own outputs -- has emerged as a primary method for enhancing performance. However, the critical factors ...
Dec 25, 2024•21 min•Ep. 279
🤗 Upvotes: 26 | cs.CV, cs.LG Authors: Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin Title: Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching Arxiv: http://arxiv.org/abs/2412.17153v2 Abstract: Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or ...
Dec 25, 2024•24 min•Ep. 278
🤗 Upvotes: 23 | cs.CL, cs.AI, cs.CV, cs.LG Authors: Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He Title: Diving into Self-Evolving Training for Multimodal Reasoning Arxiv: http://arxiv.org/abs/2412.17451v1 Abstract: Reasoning ability is essential for Large Multimodal Models (LMMs). In the absence of multimodal chain-of-thought annotated data, self-evolving training, where the model learns from its own outputs, has emerged as an effective and scalable approach for enhancing re...
Dec 25, 2024•21 min•Ep. 277
🤗 Upvotes: 16 | cs.CL, cs.AI, cs.LG Authors: Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam Title: Deliberation in Latent Space via Differentiable Cache Augmentation Arxiv: http://arxiv.org/abs/2412.17747v1 Abstract: Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before resp...
Dec 25, 2024•22 min•Ep. 276
🤗 Upvotes: 15 | cs.CV Authors: Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen Title: Large Motion Video Autoencoding with Cross-modal Video VAE Arxiv: http://arxiv.org/abs/2412.17805v1 Abstract: Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compre...
Dec 25, 2024•25 min•Ep. 275
🤗 Upvotes: 12 | cs.AI Authors: OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Be...
Dec 25, 2024•25 min•Ep. 274
🤗 Upvotes: 12 | cs.CL, cs.AI, cs.LG Authors: Jinheon Baek, Sun Jae Lee, Prakhar Gupta, Geunseob, Oh, Siddharth Dalmia, Prateek Kolhar Title: Revisiting In-Context Learning with Long Context Language Models Arxiv: http://arxiv.org/abs/2412.16926v1 Abstract: In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making ex...
Dec 25, 2024•24 min•Ep. 273
🤗 Upvotes: 11 | cs.CL, cs.AI, cs.LG, cs.SE Authors: Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang Title: Outcome-Refining Process Supervision for Code Generation Arxiv: http://arxiv.org/abs/2412.15118v1 Abstract: Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows pr...
Dec 25, 2024•21 min•Ep. 272
🤗 Upvotes: 9 | cs.CY, cs.AI, cs.LG Authors: LearnLM Team, Abhinit Modi, Aditya Srikanth Veerubhotla, Aliya Rysbek, Andrea Huber, Brett Wiltshire, Brian Veprek, Daniel Gillick, Daniel Kasenberg, Derek Ahmed, Irina Jurenka, James Cohan, Jennifer She, Julia Wilkowski, Kaiz Alarakyia, Kevin McKee, Lisa Wang, Markus Kunesch, Mike Schaekermann, Miruna Pîslar, Nikhil Joshi, Parsa Mahmoudieh, Paul Jhun, Sara Wiltberger, Shakir Mohamed, Shashank Agarwal, Shubham Milind Phal, Sun Jae Lee, Theofilos Strin...
Dec 25, 2024•27 min•Ep. 271
🤗 Upvotes: 34 | cs.CV Authors: Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, Xihui Liu Title: Parallelized Autoregressive Visual Generation Arxiv: http://arxiv.org/abs/2412.15119v1 Abstract: Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process. In this paper, we propose a simple yet effective approach for parallelized au...
Dec 24, 2024•23 min•Ep. 270
🤗 Upvotes: 19 | cs.LG, cs.AI, cs.CL Authors: Huaijie Wang, Shibo Hao, Hanze Dong, Shenao Zhang, Yilin Bao, Ziran Yang, Yi Wu Title: Offline Reinforcement Learning for LLM Multi-Step Reasoning Arxiv: http://arxiv.org/abs/2412.16145v1 Abstract: Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with ...
Dec 24, 2024•21 min•Ep. 269
🤗 Upvotes: 17 | cs.CL Authors: Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou Title: SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation Arxiv: http://arxiv.org/abs/2412.13649v1 Abstract: Key-Value (KV) cache has become a bottleneck of LLMs for long-context generation. Despite the numerous efforts in this area, the optimization for the decoding phase is generally ignored. However, we believe such optimization is crucial, especially for long-output ...
Dec 24, 2024•22 min•Ep. 268
🤗 Upvotes: 13 | cs.CV Authors: Songhua Liu, Zhenxiong Tan, Xinchao Wang Title: CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up Arxiv: http://arxiv.org/abs/2412.16112v1 Abstract: Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this iss...
Dec 24, 2024•25 min•Ep. 267
🤗 Upvotes: 12 | cs.CV, cs.LG, cs.SD, eess.AS Authors: Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji Title: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis Arxiv: http://arxiv.org/abs/2412.15322v1 Abstract: We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on ...
Dec 24, 2024•23 min•Ep. 266
🤗 Upvotes: 9 | cs.CV Authors: Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, Sungroh Yoon Title: Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage Arxiv: http://arxiv.org/abs/2412.15484v1 Abstract: Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We...
Dec 24, 2024•28 min•Ep. 265
🤗 Upvotes: 6 | cs.CV, 68U10, 68T10, I.4.5; I.2.10 Authors: Hyun-kyu Ko, Dongheok Park, Youngin Park, Byeonghyeon Lee, Juhee Han, Eunbyung Park Title: Sequence Matters: Harnessing Video Models in 3D Super-Resolution Arxiv: http://arxiv.org/abs/2412.11525v3 Abstract: 3D super-resolution aims to reconstruct high-fidelity 3D models from low-resolution (LR) multi-view images. Early studies primarily focused on single-image super-resolution (SISR) models to upsample LR images into high-resolution ima...
Dec 24, 2024•22 min•Ep. 264
🤗 Upvotes: 5 | cs.CV, cs.LG Authors: Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu Title: TRecViT: A Recurrent Video Transformer Arxiv: http://arxiv.org/abs/2412.14294v1 Abstract: We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurre...
Dec 24, 2024•25 min•Ep. 263
🤗 Upvotes: 4 | cs.LG Authors: Zhen Zheng, Xiaonan Song, Chuanjie Liu Title: MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design Arxiv: http://arxiv.org/abs/2412.14590v1 Abstract: Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or system inefficiency. In this paper, we make a comp...
Dec 24, 2024•23 min•Ep. 262