Beyond Object Lists
IED risk assessment cannot stop at isolated labels. FALCON grounds dismantled components and reasons about how they function together inside a cluttered scan.
A multimodal framework for reasoning and grounding over dismantled IED components, component compatibility, and scene-level risk in cluttered baggage scans.
Conventional vision-language models are largely object-centric, focusing on detecting and describing individual entities. In safety-critical X-ray baggage screening, however, risk often emerges not from a single object but from the functional compatibility of spatially dispersed, dismantled IED components such as batteries, detonators, and explosive charges. We formalize this setting as compositional threat reasoning, where risk is modeled as a relational property of grounded regions rather than an independent detection outcome. We introduce FALCON, a multimodal framework for reasoning and grounding over dismantled IED components. FALCON abstracts segmentation-aware region features into a structured safety state capturing component presence, pairwise functional compatibility, and scene-level risk. This structured representation is injected into the language model as an explicit intermediate interface, encouraging relationally consistent and safety-aware reasoning. Falcon-X evaluates dense grounding with structured supervision over dismantled component completeness, compatibility, and risk inference in cluttered X-ray imagery. Experiments show that existing multimodal models adapt to appearance but struggle with compositional safety reasoning, while FALCON improves functional grounding and produces more coherent scene-level risk assessments.
IED risk assessment cannot stop at isolated labels. FALCON grounds dismantled components and reasons about how they function together inside a cluttered scan.
Segmentation-aware region features are abstracted into dismantled component presence, pairwise compatibility, and scene-level risk.
The language model receives an explicit intermediate interface, encouraging grounded, relationally consistent explanations of component compatibility and risk.
Existing multimodal systems often describe visible objects or overfit to appearance. FALCON focuses on grounded dismantled IED components and their functional compatibility rather than a generic scene caption.
Dense X-ray regions are proposed and encoded so the language model can refer to grounded IED components rather than unstructured image tokens alone.
Dismantled component presence and pairwise functional compatibility are represented as an intermediate safety state before final reasoning.
FALCON fuses visual regions, structured state tokens, and prompts to answer grounding, compatibility, completeness, and scene-level risk questions.
| Dataset | # | DC | Ann. | Tasks | Av | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Lab | Box | Msk | Cap | RefS | RefP | Pan | VQA | RL | FG | GG | SA | ||||
| GDXray (JNDE'15) | 8,150 | × | ✓ | ✓ | × | × | × | × | × | × | × | × | × | × | ✓ |
| SIXray (CVPR'19) | 8,929 | × | ✓ | ✓ | × | × | × | × | × | × | × | × | × | × | ✓ |
| OPIXray (ACMMM'20) | 46,642 | × | ✓ | ✓ | × | × | × | × | × | × | × | × | × | × | ✓ |
| HiXray (ICCV'21) | 45,364 | × | ✓ | ✓ | × | × | × | × | × | × | × | × | × | × | ✓ |
| CLCxray (ITIFS'22) | 9,565 | × | ✓ | ✓ | × | × | × | × | × | × | × | × | × | × | ✓ |
| PIXray (ITMM'22) | 5,046 | × | ✓ | ✓ | ✓ | × | × | × | × | × | × | × | × | × | ✓ |
| EDS (CVPR'22) | 14,219 | × | ✓ | ✓ | × | ✓ | × | × | × | × | × | × | × | × | ✓ |
| LPIXray (CIPAE'23) | 60,950 | × | ✓ | ✓ | × | × | × | × | × | × | × | × | × | × | × |
| PIDray (IJCV'23) | 47,677 | × | ✓ | ✓ | ✓ | × | × | × | × | × | × | × | × | × | ✓ |
| DvXray (ITIFS'24) | 5,496 | × | ✓ | ✓ | × | × | × | × | × | × | × | × | × | × | ✓ |
| STCray (CVPR'25) | 46,642 | × | ✓ | ✓ | ✓ | ✓ | × | × | × | × | × | × | × | × | ✓ |
| Falcon-X | 6,911 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Method | Image Captioning | VQA | |||||
|---|---|---|---|---|---|---|---|
| BLEU | ROUGE-L | CIDEr | BLEU | METEOR | ROUGE-L | CIDEr | |
| Zero-shot | |||||||
| Sa2Va | 25.06 | 18.11 | 0.012 | 2.88 | – | 3.25 | – |
| Kosmos-2 | 16.26 | 17.56 | 0.002 | 2.83 | 5.47 | 3.79 | 0.003 |
| Llava-1.5 | 12.67 | 15.67 | – | 2.77 | 7.64 | 3.39 | – |
| GLaMM | 15.91 | 17.56 | 0.0032 | 1.95 | 4.86 | 3.52 | 0.001 |
| LISA | 20.57 | 18.08 | 0.01 | 3.86 | – | 3.94 | – |
| Groma | 18.8 | 16.48 | 0.004 | 3.32 | 8.90 | 3.44 | – |
| Sting-Bee | 24.61 | 17.95 | 0.012 | 3.86 | 7.74 | 4.04 | 0.01 |
| Fine-tuned | |||||||
| LISA | 35.57 | 39.13 | 0.04 | 99.91 | 91.42 | 99.93 | 6.25 |
| Sting-Bee | 33.38 | 37.48 | 0.024 | 99.12 | 91.27 | 99.92 | 6.24 |
| Groma | 40.7 | 47.69 | 0.061 | 99.86 | 91.24 | 99.92 | 6.24 |
| FALCON | 40.92+0.22 | 47.74+0.05 | 0.064+0.003 | 99.93+0.02 | 91.98+0.56 | 99.95+0.02 | 6.26+0.01 |
| Method | RS | PS | RPS | RFG | ||||
|---|---|---|---|---|---|---|---|---|
| cIoU | mIoU | cIoU | mIoU | cIoU | mIoU | cIoU | mIoU | |
| Zero-shot | ||||||||
| Sa2Va | 14.96 | 32.78 | 35.53 | 54.14 | 17.5 | 37.98 | 14.26 | 20.64 |
| Kosmos-2 | 2.95 | 3.2 | 2.78 | 3.08 | 2.92 | 3.32 | 3.61 | 3.62 |
| GLaMM | 12.37 | 19.84 | 6.41 | 13.72 | 11.28 | 18.93 | 5.73 | 17.02 |
| LISA | 3.50 | 6.73 | 9.13 | 14.21 | 11.50 | 16.95 | 11.86 | 13.24 |
| Groma | 11.61 | 20.49 | 7.88 | 16.04 | 14.06 | 21.2 | 8.16 | 26.31 |
| Sting-Bee | 15.23 | 22.37 | 12.45 | 15.69 | 13.21 | 14.95 | 10.45 | 11.89 |
| Fine-tuned | ||||||||
| LISA | 11.82 | 14.75 | 35.18 | 37.12 | 30.12 | 33.34 | 9.45 | 12.98 |
| Sting-Bee | 18.72 | 26.81 | 31.05 | 33.49 | 15.40 | 19.59 | 12.72 | 14.96 |
| Groma | 14.25 | 21.99 | 36.73 | 13.78 | 17.94 | 24.1 | 20.07 | 29.55 |
| FALCON | 25.86+7.14 | 35.39+2.61 | 14.37-21.16 | 38.02-16.12 | 21.74-8.38 | 35.30-2.68 | 50.45+30.38 | 69.58+40.03 |
| Method | Semantic Grounding | Safety Relationship | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| CPC | MCI | FC | SRL | PCS | CLR | |||||
| Acc | F1 | Acc | F1 | MAE | RMSE | MAE | Acc | F1 | MAE | |
| Zero-shot | ||||||||||
| Sa2Va | 58.4 | 46.2 | 12.6 | 15.8 | 0.60 | – | 0.40 | 13.56 | 54.9 | – |
| Kosmos-2 | 57.09 | 36.36 | 14.16 | 15.26 | 0.57 | 0.67 | 0.43 | 14.3 | 57.1 | – |
| Llava-1.5 | 57.4 | 47.12 | 14.32 | 14.93 | 0.7 | – | – | 14.1 | 56.47 | – |
| GLaMM | 55.8 | 35.9 | 11.6 | 14.2 | 0.52 | – | 0.31 | 12.6 | 56.2 | – |
| LISA | – | – | 14.29 | 14.29 | 0.9 | 0.9 | – | – | – | – |
| Groma | 57.1 | 36.4 | 13.3 | 19.9 | 0.59 | 0.69 | 0.36 | 14.63 | 57.05 | – |
| Sting-Bee | 57.14 | 36.36 | 14.29 | 14.29 | 0.49 | 0.48 | 0.42 | 0.01 | 42.86 | – |
| Fine-tuned | ||||||||||
| LISA | – | – | 17.63 | 19.25 | 0.28 | 0.35 | 0.20 | 0.41 | 43.06 | 0.20 |
| Sting-Bee | 98 | 98 | 93.45 | 96.31 | 0.031 | 0.36 | 0.23 | 14.89 | 56.34 | 0.24 |
| Groma | 98 | 98 | 97.1 | 98.6 | 0.019 | 0.13 | – | 14.9 | 57.06 | – |
| FALCON | 98.1+0.1 | 97.97-0.03 | 94.75-2.35 | 97.33-1.27 | 0.017-0.002 | 0.09-0.04 | 0.02-0.18 | 15.1+0.2 | 57.69+0.59 | 0.005-0.195 |
1 of 3
@inproceedings{michael2026falcon,
title={FALCON: Functional Assembly and Language for Compositional Reasoning in X-ray},
author={Michael, Yonathan and Alansari, Mohamad and Takele, Natnael and Henschel, Andreas and Werghi, Naoufel},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2026}
}