Kickoff event has moved from the 13-07 to the 10-08.

Metrics & Ranking¶

The central difficulty in evaluating CBCT reconstruction is the absence of a perfectly aligned ground truth. A clinical planning CT and the CBCT it is compared against are acquired days — sometimes weeks — apart, so the underlying anatomy has genuinely changed. Deformable image registration (using Elastix with an Impact loss) warps the planning CT onto the CBCT anatomy to narrow this gap, but residual mismatch always remains, particularly in deformable soft tissue. Synthetic data sidesteps this by providing known ground truth, but a simulation can never fully reproduce the artifacts of a real scan.

COBRA's evaluation is built around this reality. Rather than trust any single metric against an imperfect reference, we score each reconstruction across three complementary metric groups, each chosen to capture something the others cannot. All evaluation code will be made openly available at github.com/cobra-challenge-2026.

Where indicated, metrics are computed within the Field-of-view provided with each case; in cases where contrast media was administered during the CT. The code that generates this evaluation region is distributed with the data.

The Metric Groups¶

1. Intensity Accuracy — vs. deformed planning CT¶

These metrics ask whether the reconstructed Hounsfield units are correct, which is the foundation of any downstream dose calculation — dose engines read tissue attenuation directly from HU. They are computed against the deformably-registered planning CT within the field-of-view mask.

Mean Absolute Error (MAE) — the mean absolute voxel-wise difference in Hounsfield units. Lower is better. The most direct measure of HU fidelity.
Peak Signal-to-Noise Ratio (PSNR) — in dB, relative to the typical HU range. Higher is better. Rewards reconstructions that are globally faithful and penalizes large intensity errors.

Limitation this group cannot escape: it compares against an imperfectly-aligned CT, so some residual error reflects registration mismatch rather than reconstruction quality — which is exactly why the next two groups exist.

2. Geometric Consistency — vs. reconstructed CBCT¶

These metrics check that the reconstruction is spatially faithful to the original CBCT anatomy, catching distortion, warping, or hallucinated structure independently of absolute intensity. Because they compare the reconstruction against the CBCT it was derived from (reconstructed via RTK for real and simulated cases), they are unaffected by the CT–CBCT registration mismatch above.

Normalized Cross-Correlation (NCC) — measures linear intensity correspondence and structural alignment. Higher is better.
Normalized Mutual Information (NMI) — measures shared information between images without assuming a linear intensity relationship, making it robust to the intensity differences expected between a reconstruction and its source CBCT. Higher is better.

As the reconstructed CBCT can carry artifacts like saturation and strike artifacts these metrics will reward images that also capture those. Therefore, the combination is important and also leads us to the third group.

3. Perceptual Similarity — vs. deformed planning CT¶

Intensity and geometry metrics can both look healthy while a reconstruction still carries streaking, residual scatter shading, or texture artifacts that would make it clinically unusable. Deep feature embeddings are sensitive to exactly these failures. Each metric is the cosine similarity between embeddings of the reconstruction and the reference, computed on axial slices and averaged across all slices to give the per-case score.

CS_BiomedCLIP — cosine similarity of embeddings from the frozen BiomedCLIP vision encoder. Slices are resized to 224×224 and normalized per the encoder's specification. Higher is better.
CS_FlexiCT — cosine similarity of embeddings from the frozen FlexiCT encoder. Slices are resized to 448×448 and normalized per the encoder's specification. Higher is better.

The three groups, at a glance

Group Metrics Compared against Captures

Intensity accuracy MAE (HU), PSNR (dB) Deformed planning CT HU fidelity for dose calculation

Geometric consistency NCC, NMI reconstructed CBCT Spatial fidelity, distortion

Perceptual similarity CS_BiomedCLIP, CS_FlexiCT Deformed planning CT Artifacts the other two miss

Group	Metrics	Compared against	Captures
Intensity accuracy	MAE (HU), PSNR (dB)	Deformed planning CT	HU fidelity for dose calculation
Geometric consistency	NCC, NMI	reconstructed CBCT	Spatial fidelity, distortion
Perceptual similarity	CS_BiomedCLIP, CS_FlexiCT	Deformed planning CT	Artifacts the other two miss

The three groups are deliberately complementary: intensity is what dose needs but is hostage to registration error; geometry sidesteps registration but rewards artifact; perceptual similarity catches the artifacts that slip past both. Only a reconstruction that satisfies all three is clinically convincing.

Ranking — RankThenMean¶

Submissions are ranked using the RankThenMean procedure:

Average per metric. For each metric, compute the mean value over all test patients for each submission.
Rank per metric. Rank these averages across all submissions, from 1 (best) to n (worst).
Average within each group. Average the ranks within each of the three metric groups (intensity, consistency, similarity).
Average across groups. Average the three group-ranks to obtain the final score.

This gives every group equal weight and avoids arbitrary normalization between metrics with different scales and units (e.g. MAE in Hounsfield units versus the unitless NMI). The approach was validated in the SynthRAD challenge series, where it proved stable and robust to outliers without normalization.

Tie-breaking. If two submissions share the same final score, the team with the superior rank on the synthetic data subset ranks higher, as the synthetic data is the most controlled scenario — the only one with true, perfectly-aligned ground truth.

Metrics by Phase¶

The same metric groups and ranking procedure are used in both phases; the validation phase serves as feedback and rehearsal, and the test phase determines the official result.

	Validation Phase	Test Phase
Intensity	MAE, PSNR	MAE, PSNR
Consistency	NCC, NMI	NCC, NMI
Perceptual	---	CS_BiomedCLIP, CS_FlexiCT
Evaluation set	Validation set (equal split real / synthetic)	Test set (equal split real / synthetic)
Ranking	RankThenMean (feedback only)	RankThenMean (official ranking)
Visibility	Open leaderboard	Final ranking after test phase closes