FALCON: Functional Assembly and Language for Compositional Reasoning in X-ray

Abstract

Conventional vision-language models are largely object-centric, focusing on detecting and describing individual entities. In safety-critical X-ray baggage screening, however, risk often emerges not from a single object but from the functional compatibility of spatially dispersed, dismantled IED components such as batteries, detonators, and explosive charges. We formalize this setting as compositional threat reasoning, where risk is modeled as a relational property of grounded regions rather than an independent detection outcome. We introduce FALCON, a multimodal framework for reasoning and grounding over dismantled IED components. FALCON abstracts segmentation-aware region features into a structured safety state capturing component presence, pairwise functional compatibility, and scene-level risk. This structured representation is injected into the language model as an explicit intermediate interface, encouraging relationally consistent and safety-aware reasoning. Falcon-X evaluates dense grounding with structured supervision over dismantled component completeness, compatibility, and risk inference in cluttered X-ray imagery. Experiments show that existing multimodal models adapt to appearance but struggle with compositional safety reasoning, while FALCON improves functional grounding and produces more coherent scene-level risk assessments.

Beyond Object Lists

IED risk assessment cannot stop at isolated labels. FALCON grounds dismantled components and reasons about how they function together inside a cluttered scan.

Structured Safety State

Segmentation-aware region features are abstracted into dismantled component presence, pairwise compatibility, and scene-level risk.

Safety-Aware Language

The language model receives an explicit intermediate interface, encouraging grounded, relationally consistent explanations of component compatibility and risk.

Why FALCON?

Existing multimodal systems often describe visible objects or overfit to appearance. FALCON focuses on grounded dismantled IED components and their functional compatibility rather than a generic scene caption.

Methodology

Component Grounding

Dense X-ray regions are proposed and encoded so the language model can refer to grounded IED components rather than unstructured image tokens alone.

Compatibility State

Dismantled component presence and pairwise functional compatibility are represented as an intermediate safety state before final reasoning.

Risk Inference

FALCON fuses visual regions, structured state tokens, and prompts to answer grounding, compatibility, completeness, and scene-level risk questions.

Falcon-X

Benchmark Comparison

Dataset	#	DC	Ann.				Tasks								Av
Dataset	#	DC	Lab	Box	Msk	Cap	RefS	RefP	Pan	VQA	RL	FG	GG	SA	Av
GDXray (JNDE'15)	8,150	×	✓	✓	×	×	×	×	×	×	×	×	×	×	✓
SIXray (CVPR'19)	8,929	×	✓	✓	×	×	×	×	×	×	×	×	×	×	✓
OPIXray (ACMMM'20)	46,642	×	✓	✓	×	×	×	×	×	×	×	×	×	×	✓
HiXray (ICCV'21)	45,364	×	✓	✓	×	×	×	×	×	×	×	×	×	×	✓
CLCxray (ITIFS'22)	9,565	×	✓	✓	×	×	×	×	×	×	×	×	×	×	✓
PIXray (ITMM'22)	5,046	×	✓	✓	✓	×	×	×	×	×	×	×	×	×	✓
EDS (CVPR'22)	14,219	×	✓	✓	×	✓	×	×	×	×	×	×	×	×	✓
LPIXray (CIPAE'23)	60,950	×	✓	✓	×	×	×	×	×	×	×	×	×	×	×
PIDray (IJCV'23)	47,677	×	✓	✓	✓	×	×	×	×	×	×	×	×	×	✓
DvXray (ITIFS'24)	5,496	×	✓	✓	×	×	×	×	×	×	×	×	×	×	✓
STCray (CVPR'25)	46,642	×	✓	✓	✓	✓	×	×	×	×	×	×	×	×	✓
Falcon-X	6,911	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓

DC: dismantled components; Lab: labels; Box: bounding boxes; Msk: masks; Cap: captions; RefS: referring segmentation; RefP: referring panoptic segmentation; Pan: panoptic segmentation; RL: region localization; FG: functional grounding; GG: grounded generation; SA: scene-level analysis; Av: available.

Quantitative Results

Scene Understanding

Method	Image Captioning			VQA
Method	BLEU	ROUGE-L	CIDEr	BLEU	METEOR	ROUGE-L	CIDEr
Zero-shot
Sa2Va	25.06	18.11	0.012	2.88	–	3.25	–
Kosmos-2	16.26	17.56	0.002	2.83	5.47	3.79	0.003
Llava-1.5	12.67	15.67	–	2.77	7.64	3.39	–
GLaMM	15.91	17.56	0.0032	1.95	4.86	3.52	0.001
LISA	20.57	18.08	0.01	3.86	–	3.94	–
Groma	18.8	16.48	0.004	3.32	8.90	3.44	–
Sting-Bee	24.61	17.95	0.012	3.86	7.74	4.04	0.01
Fine-tuned
LISA	35.57	39.13	0.04	99.91	91.42	99.93	6.25
Sting-Bee	33.38	37.48	0.024	99.12	91.27	99.92	6.24
Groma	40.7	47.69	0.061	99.86	91.24	99.92	6.24
FALCON	40.92+0.22	47.74+0.05	0.064+0.003	99.93+0.02	91.98+0.56	99.95+0.02	6.26+0.01

Perception And Compositional Grounding

Method	RS		PS		RPS		RFG
Method	cIoU	mIoU	cIoU	mIoU	cIoU	mIoU	cIoU	mIoU
Zero-shot
Sa2Va	14.96	32.78	35.53	54.14	17.5	37.98	14.26	20.64
Kosmos-2	2.95	3.2	2.78	3.08	2.92	3.32	3.61	3.62
GLaMM	12.37	19.84	6.41	13.72	11.28	18.93	5.73	17.02
LISA	3.50	6.73	9.13	14.21	11.50	16.95	11.86	13.24
Groma	11.61	20.49	7.88	16.04	14.06	21.2	8.16	26.31
Sting-Bee	15.23	22.37	12.45	15.69	13.21	14.95	10.45	11.89
Fine-tuned
LISA	11.82	14.75	35.18	37.12	30.12	33.34	9.45	12.98
Sting-Bee	18.72	26.81	31.05	33.49	15.40	19.59	12.72	14.96
Groma	14.25	21.99	36.73	13.78	17.94	24.1	20.07	29.55
FALCON	25.86+7.14	35.39+2.61	14.37-21.16	38.02-16.12	21.74-8.38	35.30-2.68	50.45+30.38	69.58+40.03

Semantic Grounding And Safety Relations

Method	Semantic Grounding						Safety Relationship
	CPC		MCI		FC		SRL	PCS		CLR
	Acc	F1	Acc	F1	MAE	RMSE	MAE	Acc	F1	MAE
Zero-shot
Sa2Va	58.4	46.2	12.6	15.8	0.60	–	0.40	13.56	54.9	–
Kosmos-2	57.09	36.36	14.16	15.26	0.57	0.67	0.43	14.3	57.1	–
Llava-1.5	57.4	47.12	14.32	14.93	0.7	–	–	14.1	56.47	–
GLaMM	55.8	35.9	11.6	14.2	0.52	–	0.31	12.6	56.2	–
LISA	–	–	14.29	14.29	0.9	0.9	–	–	–	–
Groma	57.1	36.4	13.3	19.9	0.59	0.69	0.36	14.63	57.05	–
Sting-Bee	57.14	36.36	14.29	14.29	0.49	0.48	0.42	0.01	42.86	–
Fine-tuned
LISA	–	–	17.63	19.25	0.28	0.35	0.20	0.41	43.06	0.20
Sting-Bee	98	98	93.45	96.31	0.031	0.36	0.23	14.89	56.34	0.24
Groma	98	98	97.1	98.6	0.019	0.13	–	14.9	57.06	–
FALCON	98.1+0.1	97.97-0.03	94.75-2.35	97.33-1.27	0.017-0.002	0.09-0.04	0.02-0.18	15.1+0.2	57.69+0.59	0.005-0.195

MAE and RMSE are lower-is-better metrics.

Qualitative Results

1 of 3

Citation

@inproceedings{michael2026falcon,
    title={FALCON: Functional Assembly and Language for Compositional Reasoning in X-ray},
    author={Michael, Yonathan and Alansari, Mohamad and Takele, Natnael and Henschel, Andreas and Werghi, Naoufel},
    booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
    year={2026}
}