Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval
May 26, 2025•16 min
Episode description
This paper introduces COCO-FACET, a new benchmark dataset designed to evaluate text-to-image retrieval models on attribute-focused queries, which differ from traditional general image caption queries. The researchers demonstrate that existing models, including CLIP-like and MLLM-based models, struggle with these specific attributes, especially those less prominent in images or less explored in training data like time and weather. To address this, they propose using promptable image embeddings with multimodal large language models (MLLMs), which significantly improves retrieval performance on attribute-focused queries. The paper also explores acceleration strategies for this method to enhance its practical application.
For the best experience, listen in Metacast app for iOS or Android
