EDJE: Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

1INSIGHT Lab, Ben-Gurion University of the Negev, Israel
*Equal Contribution
EDJE Teaser

Figure 1: EDJE enables high-throughput vision-language reranking by compressing visual tokens.

Abstract

Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision-language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image-text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval.


Method

EDJE Architecture

Figure 2: Overview of the EDJE architecture.

EDJE introduces a lightweight adapter that compresses precomputed visual tokens, allowing for efficient storage and fast online inference. By decoupling the heavy visual feature extraction from the online reranking process, we achieve significant speedups without compromising accuracy.


Results

Comparison with SOTA

Table 1: Comparison with state-of-the-art methods on Flickr30k and MS-COCO.

Qualitative Results

Figure 3: Qualitative results demonstrating the retrieval capabilities of EDJE.

Our method matches the performance of heavy joint encoders while being orders of magnitude faster.


Citation

@article{taraday2025edje,
  title={Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking},
  author={Taraday, Mitchell Keren and Wagner, Shahaf and Baskin, Chaim},
  journal={arXiv preprint arXiv:2510.06820},
  year={2025}
}