CIRCUS · Benchmark audit & clean subsets

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

Matteo Attimonelli🦪🏦 · Alessandro De Bellis🦪 · Aryo Pradipta Gema🦄 · Rohit Saxena🦄 · Monica Sekoyan🦄 · Wai-Chung Kwan🦄 · Claudio Pomo🦪 · Alessandro Suglia🦄 · Dietmar Jannach🐉 · Tommaso Di Noia🦪 · Pasquale Minervini🦄🏰

🦪 Politecnico di Bari 🏦 Sapienza University of Rome 🦄 University of Edinburgh 🐉 University of Klagenfurt 🏰 Miniml.AI

Contact: matteo.attimonelli@poliba.it

Examples of CIR queries where a retriever finds the ground-truth target from a single modality — the reference image alone or the modification text alone.
Many “composed” queries in CIRR, FashionIQ, LaSCo, and CIRCO are solvable from one modality. CIRCUS finds them and removes them.

TL;DR. Composed Image Retrieval (CIR) should require both a reference image and a modification text. Across the four standard CIR benchmarks — CIRR, FashionIQ, LaSCo, CIRCO — we audit 11 retrievers (9 open-source + 2 commercial APIs) and find that a large share of test queries can be solved with only the image or only the text. CIRCUS releases the audit, the resulting shortcut-free subset (23 228 queries), and a human-validated subset (1 689 queries) that survived both filters — plus a normalised Composition Gap metric to quantify how much a retriever actually composes.

By the numbers

11retrievers audited
4CIR benchmarks
23 228shortcut-free queries
1 689human-validated queries

How CIRCUS works

  1. Encode each query three times for every retriever in the pool: with both modalities (the standard CIR setting), with only the text (image zeroed out), and with only the image (text dropped).
  2. Record the rank of the ground-truth target under each modality.
  3. Aggregate across the 11-retriever pool and assign each query one of four labels:
    • composition_required — at least one retriever solves it with both modalities; none solves it unimodally.
    • unresolved — no retriever solves it under any modality.
    • shortcut_solvable — at least one retriever solves it with only image or only text. Excluded.
    • shortcut_free — the union of composition_required and unresolved.
  4. Human validation: re-show the surviving queries to multiple annotators using the rubric below. Queries are kept only when a majority of annotators marked them as well-formed CIR.

Validity rubric (used during human validation)

Side-by-side examples of the validity categories: valid, invalid text, invalid image, invalid target, and overly-broad queries.
Categories shown to annotators: VALIDATED, INVALID_TEXT_QUERY, INVALID_IMAGE_QUERY, INVALID_TARGET_IMAGE, QUERY_TOO_BROAD. The full rubric (severity ordering, decision flow, the ≥10-alternatives heuristic) is in annotations/annotation_instructions.md.
Screenshot of the annotation interface shown to annotators.
The annotation interface every annotator worked with.

Inter-annotator agreement

Pairwise Cohen's κ heatmap across annotators.
Pairwise Cohen's κ across the annotator pool.
Pairwise full-signature agreement heatmap across annotators.
Pairwise full-signature agreement (same set of category flags).

The full write-up — including Krippendorff's α — is in annotations/agreement_report.md.

Shortcut audit (Table 1)

Per-benchmark split sizes after running the audit across all 11 retrievers at $K = 10$.

Benchmark Total composition required unresolved shortcut-free shortcut-solvable
CIRR4 1702714146853 485
FashionIQ6 0031 4622 6074 0691 934
LaSCo30 0312 06416 35418 41811 613
CIRCO22053356164

“shortcut-free” = composition_requiredunresolved.

Human-validated subset (Table 2)

For CIRR and CIRCO the full shortcut-free residue was audited; for FashionIQ and LaSCo a stratified sample of 1 000 + 1 000 was audited.

Benchmark audited (comp_req) validated_solved audited (unresolved) validated_unsolved total validated
CIRR271147414156303
FashionIQ1 0003681 000218586
LaSCo1 0004521 000306758
CIRCO53393342
Total2 3241 0062 4176831 689

The Composition Gap

Once the subsets are built, we need to ask: does a retriever actually use both modalities? Following the paper, we define the normalised Composition Gap:

$$\mathrm{CompGap} \;=\; 1 \;-\; \dfrac{\max(I,\, T)}{\mathrm{MM}}$$

where $\mathrm{MM}$, $I$, and $T$ are the full-catalogue nDCG of the same retriever under the multimodal, image-only, and text-only inputs.

$\mathrm{CompGap}$ measures the fraction of ranking quality that cannot be recovered from either unimodal input. Larger values mean the retriever genuinely needs both modalities; values close to zero mean one modality is enough.

We use a normalised gap because absolute nDCG varies a lot across splits — the shortcut-free and validated subsets are harder, so the absolute multimodal score drops. Normalising by $\mathrm{MM}$ makes a fair comparison possible. An MRR variant $\mathrm{CompGap}_{\mathrm{MRR}} = 1 - \max(I,T)/\mathrm{MM}_{\mathrm{MRR}}$ follows the same trends and is reported in the appendix.

Bar chart of retriever-averaged Composition Gap (nDCG) on Full, shortcut-free, and validated splits for each dataset.
Retriever-averaged $\mathrm{CompGap}$ on the original benchmark (Full), CIRCUS-SF (SF), and CIRCUS-V (V).
Benchmark Full Shortcut-free (SF) Validated (V) Full → V
CIRR0.1370.3130.361+0.224
FashionIQ0.2980.3780.477+0.179
LaSCo0.0690.0790.209+0.140
CIRCO0.4700.5690.562+0.092
Validated CIR queries depend on multimodal composition much more than the raw benchmarks suggest. CIRCO already had a high CompGap because its original composition-required core is relatively clean.

Recall@10 across the audited subsets

Recall@10 (%) on the original benchmark (Full), the shortcut-free subset (SF), and the validated subset (V). Image-only and text-only columns are omitted because, by construction, no unimodal configuration solves these queries within top-10. Best in column is bold, second-best underlined.

Retriever CIRR FashionIQ LaSCo CIRCO
FullSFV FullSFV FullSFV FullSFV
E5-Omni55.211.815.216.96.912.59.40.52.365.939.338.1
GME-Qwen2VL64.416.120.431.114.723.813.21.911.285.969.666.7
LamRA65.118.122.633.217.531.112.31.37.581.469.666.7
LamRA-Qwen2.5VL65.215.219.831.916.329.712.81.37.084.567.964.3
MM-Embed63.316.220.125.511.319.618.23.119.082.758.961.9
Qwen3-VL-2B64.613.717.628.311.724.711.21.610.678.257.159.5
Qwen3-VL-8B70.117.224.132.614.728.513.32.113.586.471.471.4
Rzen-Embed70.117.422.330.814.926.212.81.710.786.471.466.7
VLM2Vec-V255.011.115.215.85.49.29.11.37.938.619.623.8
Gemini Emb. 249.211.815.224.911.518.712.42.110.474.162.561.9
Voyage MM-3.554.410.214.219.47.414.714.81.810.477.751.852.4
Recall@10 collapses dramatically when shortcut queries are removed — e.g. Qwen3-VL-8B on CIRR goes 70.1 → 17.2 → 24.1, and on FashionIQ 32.6 → 14.7 → 28.5.

Retriever pool

9 open-source + 2 commercial APIs:

  • E5-Omni · GME-Qwen2VL · LamRA · LamRA-Qwen2.5VL · MM-Embed
  • Qwen3-VL-Embedding (2B and 8B) · Rzen-Embed · VLM2Vec-V2
  • Gemini Embedding 2 · Voyage Multimodal 3.5

Reproduce in five stages

  1. Envs — one conda env per retriever family (envs/create_*.sh).
  2. Per-(dataset, retriever) retrievalretrieval/run_generate_retrieval_data*.sh writes one JSON per pair with multimodal / text-only / image-only ranks.
  3. Aggregateretrieval/aggregate_retrieval_data.py turns ranks into the audit labels (already shipped under shortcut_audit/).
  4. Validated subsetfinal_dataset/build_validated_subsets.py derives final_dataset/query_jsonl/ from annotations/users/.
  5. Re-evaluationsubset_evaluation/run_subset_eval.sh writes Recall@10 (and the two ablations, when requested) per (dataset, subset, retriever) to subset_evaluation/logs/results.jsonl.

Step-by-step instructions are in the repository README.

Citation

@inproceedings{circus,
  title  = {Do Composed Image Retrieval Benchmarks Require Multimodal Composition?},
  author = {Attimonelli, Matteo and De Bellis, Alessandro and Gema, Aryo Pradipta and
            Saxena, Rohit and Sekoyan, Monica and Kwan, Wai-Chung and Pomo, Claudio and
            Suglia, Alessandro and Jannach, Dietmar and Di Noia, Tommaso and
            Minervini, Pasquale},
  year   = {2026},
  eprint = {2605.14787},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url    = {https://arxiv.org/abs/2605.14787},
}