Figure 1: EDJE enables high-throughput vision-language reranking by compressing visual tokens.
Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision-language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image-text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval.
Figure 2: Overview of the EDJE architecture.
EDJE introduces a lightweight adapter that compresses precomputed visual tokens, allowing for efficient storage and fast online inference. By decoupling the heavy visual feature extraction from the online reranking process, we achieve significant speedups without compromising accuracy.
Table 1: Comparison with state-of-the-art methods on Flickr30k and MS-COCO.
Figure 3: Qualitative results demonstrating the retrieval capabilities of EDJE.
Our method matches the performance of heavy joint encoders while being orders of magnitude faster.
@article{taraday2025edje,
title={Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking},
author={Taraday, Mitchell Keren and Wagner, Shahaf and Baskin, Chaim},
journal={arXiv preprint arXiv:2510.06820},
year={2025}
}