FALCON logo

FALCON: Functional Assembly and Language for Compositional Reasoning in X-ray

A multimodal framework for reasoning and grounding over dismantled IED components, component compatibility, and scene-level risk in cluttered baggage scans.

Yonathan Michael Mohamad Alansari Natnael Takele Andreas Henschel Naoufel Werghi
Khalifa University, Abu Dhabi, UAE
ECCV 2026
Code

Abstract

Conventional vision-language models are largely object-centric, focusing on detecting and describing individual entities. In safety-critical X-ray baggage screening, however, risk often emerges not from a single object but from the functional compatibility of spatially dispersed, dismantled IED components such as batteries, detonators, and explosive charges. We formalize this setting as compositional threat reasoning, where risk is modeled as a relational property of grounded regions rather than an independent detection outcome. We introduce FALCON, a multimodal framework for reasoning and grounding over dismantled IED components. FALCON abstracts segmentation-aware region features into a structured safety state capturing component presence, pairwise functional compatibility, and scene-level risk. This structured representation is injected into the language model as an explicit intermediate interface, encouraging relationally consistent and safety-aware reasoning. Falcon-X evaluates dense grounding with structured supervision over dismantled component completeness, compatibility, and risk inference in cluttered X-ray imagery. Experiments show that existing multimodal models adapt to appearance but struggle with compositional safety reasoning, while FALCON improves functional grounding and produces more coherent scene-level risk assessments.

Beyond Object Lists

IED risk assessment cannot stop at isolated labels. FALCON grounds dismantled components and reasons about how they function together inside a cluttered scan.

Structured Safety State

Segmentation-aware region features are abstracted into dismantled component presence, pairwise compatibility, and scene-level risk.

Safety-Aware Language

The language model receives an explicit intermediate interface, encouraging grounded, relationally consistent explanations of component compatibility and risk.

Why FALCON?

Existing multimodal systems often describe visible objects or overfit to appearance. FALCON focuses on grounded dismantled IED components and their functional compatibility rather than a generic scene caption.

Methodology

Component Grounding

Dense X-ray regions are proposed and encoded so the language model can refer to grounded IED components rather than unstructured image tokens alone.

Compatibility State

Dismantled component presence and pairwise functional compatibility are represented as an intermediate safety state before final reasoning.

Risk Inference

FALCON fuses visual regions, structured state tokens, and prompts to answer grounding, compatibility, completeness, and scene-level risk questions.

Falcon-X

Benchmark Comparison

Dataset # DC Ann. Tasks Av
Lab Box Msk Cap RefS RefP Pan VQA RL FG GG SA
GDXray (JNDE'15) 8,150 × × × × × × × × × × ×
SIXray (CVPR'19) 8,929 × × × × × × × × × × ×
OPIXray (ACMMM'20) 46,642 × × × × × × × × × × ×
HiXray (ICCV'21) 45,364 × × × × × × × × × × ×
CLCxray (ITIFS'22) 9,565 × × × × × × × × × × ×
PIXray (ITMM'22) 5,046 × × × × × × × × × ×
EDS (CVPR'22) 14,219 × × × × × × × × × ×
LPIXray (CIPAE'23) 60,950 × × × × × × × × × × × ×
PIDray (IJCV'23) 47,677 × × × × × × × × × ×
DvXray (ITIFS'24) 5,496 × × × × × × × × × × ×
STCray (CVPR'25) 46,642 × × × × × × × × ×
Falcon-X 6,911
DC: dismantled components; Lab: labels; Box: bounding boxes; Msk: masks; Cap: captions; RefS: referring segmentation; RefP: referring panoptic segmentation; Pan: panoptic segmentation; RL: region localization; FG: functional grounding; GG: grounded generation; SA: scene-level analysis; Av: available.

Quantitative Results

Scene Understanding

Method Image Captioning VQA
BLEU ROUGE-L CIDEr BLEU METEOR ROUGE-L CIDEr
Zero-shot
Sa2Va 25.06 18.11 0.012 2.88 3.25
Kosmos-2 16.26 17.56 0.002 2.83 5.47 3.79 0.003
Llava-1.5 12.67 15.67 2.77 7.64 3.39
GLaMM 15.91 17.56 0.0032 1.95 4.86 3.52 0.001
LISA 20.57 18.08 0.01 3.86 3.94
Groma 18.8 16.48 0.004 3.32 8.90 3.44
Sting-Bee 24.61 17.95 0.012 3.86 7.74 4.04 0.01
Fine-tuned
LISA 35.57 39.13 0.04 99.91 91.42 99.93 6.25
Sting-Bee 33.38 37.48 0.024 99.12 91.27 99.92 6.24
Groma 40.7 47.69 0.061 99.86 91.24 99.92 6.24
FALCON 40.92+0.22 47.74+0.05 0.064+0.003 99.93+0.02 91.98+0.56 99.95+0.02 6.26+0.01

Perception And Compositional Grounding

Method RS PS RPS RFG
cIoU mIoU cIoU mIoU cIoU mIoU cIoU mIoU
Zero-shot
Sa2Va 14.96 32.78 35.53 54.14 17.5 37.98 14.26 20.64
Kosmos-2 2.95 3.2 2.78 3.08 2.92 3.32 3.61 3.62
GLaMM 12.37 19.84 6.41 13.72 11.28 18.93 5.73 17.02
LISA 3.50 6.73 9.13 14.21 11.50 16.95 11.86 13.24
Groma 11.61 20.49 7.88 16.04 14.06 21.2 8.16 26.31
Sting-Bee 15.23 22.37 12.45 15.69 13.21 14.95 10.45 11.89
Fine-tuned
LISA 11.82 14.75 35.18 37.12 30.12 33.34 9.45 12.98
Sting-Bee 18.72 26.81 31.05 33.49 15.40 19.59 12.72 14.96
Groma 14.25 21.99 36.73 13.78 17.94 24.1 20.07 29.55
FALCON 25.86+7.14 35.39+2.61 14.37-21.16 38.02-16.12 21.74-8.38 35.30-2.68 50.45+30.38 69.58+40.03

Semantic Grounding And Safety Relations

Method Semantic Grounding Safety Relationship
CPC MCI FC SRL PCS CLR
Acc F1 Acc F1 MAE RMSE MAE Acc F1 MAE
Zero-shot
Sa2Va 58.4 46.2 12.6 15.8 0.60 0.40 13.56 54.9
Kosmos-2 57.09 36.36 14.16 15.26 0.57 0.67 0.43 14.3 57.1
Llava-1.5 57.4 47.12 14.32 14.93 0.7 14.1 56.47
GLaMM 55.8 35.9 11.6 14.2 0.52 0.31 12.6 56.2
LISA 14.29 14.29 0.9 0.9
Groma 57.1 36.4 13.3 19.9 0.59 0.69 0.36 14.63 57.05
Sting-Bee 57.14 36.36 14.29 14.29 0.49 0.48 0.42 0.01 42.86
Fine-tuned
LISA 17.63 19.25 0.28 0.35 0.20 0.41 43.06 0.20
Sting-Bee 98 98 93.45 96.31 0.031 0.36 0.23 14.89 56.34 0.24
Groma 98 98 97.1 98.6 0.019 0.13 14.9 57.06
FALCON 98.1+0.1 97.97-0.03 94.75-2.35 97.33-1.27 0.017-0.002 0.09-0.04 0.02-0.18 15.1+0.2 57.69+0.59 0.005-0.195
MAE and RMSE are lower-is-better metrics.

Qualitative Results

1 of 3

Citation

@inproceedings{michael2026falcon,
    title={FALCON: Functional Assembly and Language for Compositional Reasoning in X-ray},
    author={Michael, Yonathan and Alansari, Mohamad and Takele, Natnael and Henschel, Andreas and Werghi, Naoufel},
    booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
    year={2026}
}