📏 Metrics & Ranking

The central difficulty in evaluating CBCT reconstruction is the absence of a perfectly aligned ground truth. A clinical planning CT and the CBCT it is compared against are acquired days — sometimes weeks — apart, so the underlying anatomy has genuinely changed. Deformable image registration (using Elastix with an Impact loss) warps the planning CT onto the CBCT anatomy to narrow this gap, but residual mismatch always remains, particularly in deformable soft tissue and gas-filled regions. Synthetic data sidesteps this by providing known ground truth, but a simulation can never fully reproduce the artifacts of a real scan.

COBRA's evaluation is built around this reality. Rather than trust any single metric against an imperfect reference, we score each reconstruction across three complementary metric groups, each chosen to capture something the others cannot. All evaluation code will be made openly available at github.com/cobra-challenge-2026.

Where indicated, metrics are computed within the Fiedl-of-view provided with each case; in cases where contrast media was administered during the CT, the affected bowel region is excluded from the mask. The code that generates this evaluation region is distributed with the data.


🔍 The Metric Groups

1️⃣ Intensity Accuracy — vs. deformed planning CT

These metrics ask whether the reconstructed Hounsfield units are correct, which is the foundation of any downstream dose calculation — dose engines read tissue attenuation directly from HU. They are computed against the deformably-registered planning CT within the body-contour mask.

  • Mean Absolute Error (MAE) — the mean absolute voxel-wise difference in Hounsfield units. Lower is better. The most direct measure of HU fidelity.
  • Peak Signal-to-Noise Ratio (PSNR) — in dB, relative to the typical HU range. Higher is better. Rewards reconstructions that are globally faithful and penalizes large intensity errors.

Limitation this group cannot escape: it compares against an imperfectly-aligned CT, so some residual error reflects registration mismatch rather than reconstruction quality — which is exactly why the next two groups exist.

2️⃣ Geometric Consistency — vs. clinical / simulated CBCT

These metrics check that the reconstruction is spatially faithful to the original CBCT anatomy, catching distortion, warping, or hallucinated structure independently of absolute intensity. Because they compare the reconstruction against the CBCT it was derived from (clinical for real cases, simulated for synthetic), they are unaffected by the CT–CBCT registration mismatch above.

  • Normalized Cross-Correlation (NCC) — measures linear intensity correspondence and structural alignment. Higher is better.
  • Normalized Mutual Information (NMI) — measures shared information between images without assuming a linear intensity relationship, making it robust to the intensity differences expected between a reconstruction and its source CBCT. Higher is better.

3️⃣ Perceptual Similarity — vs. deformed planning CT

Intensity and geometry metrics can both look healthy while a reconstruction still carries streaking, residual scatter shading, or texture artifacts that would make it clinically unusable. Deep feature embeddings are sensitive to exactly these failures. Each metric is the cosine similarity between embeddings of the reconstruction and the reference, computed on axial slices and averaged across all slices to give the per-case score.

  • CS_BiomedCLIP — cosine similarity of embeddings from the frozen BiomedCLIP vision encoder. Slices are resized to 224×224 and normalized per the encoder's specification. Higher is better.
  • CS_MedDINOv3 — cosine similarity of embeddings from the frozen MedDINOv3 encoder. Slices are resized to 448×448 and normalized per the encoder's specification. Higher is better.

The three groups, at a glance

Group Metrics Compared against Captures
Intensity accuracy MAE (HU), PSNR (dB) Deformed planning CT HU fidelity for dose calculation
Geometric consistency NCC, NMI Clinical / simulated CBCT Spatial fidelity, distortion
Perceptual similarity CS_BiomedCLIP, CS_MedDINOv3 Deformed planning CT Artifacts the other two miss

The three groups are deliberately complementary: intensity is what dose needs but is hostage to registration error; geometry sidesteps registration but ignores absolute HU; perceptual similarity catches the artifacts that slip past both. Only a reconstruction that satisfies all three is clinically convincing.


🏆 Ranking — RankThenMean

Submissions are ranked using the RankThenMean procedure:

  1. Average per metric. For each metric, compute the mean value over all test patients for each submission.
  2. Rank per metric. Rank these averages across all submissions, from 1 (best) to n (worst).
  3. Average within each group. Average the ranks within each of the three metric groups (intensity, consistency, similarity).
  4. Average across groups. Average the three group-ranks to obtain the final score.

This gives every group equal weight and avoids arbitrary normalization between metrics with different scales and units (e.g. MAE in Hounsfield units versus the unitless NMI). The approach was validated in the SynthRAD challenge series, where it proved stable and robust to outliers without normalization.

Tie-breaking. If two submissions share the same final score, the team with the superior rank on the synthetic data subset ranks higher, as the synthetic data is the most controlled scenario — the only one with true, perfectly-aligned ground truth.


📊 Metrics by Phase

The same metric groups and ranking procedure are used in both phases; the validation phase serves as feedback and rehearsal, and the test phase determines the official result.

Validation Phase Test Phase
Intensity MAE, PSNR MAE, PSNR
Consistency NCC, NMI NCC, NMI
Perceptual --- CS_BiomedCLIP, CS_MedDINOv3
Evaluation set Validation set (equal split real / synthetic) Test set (equal split real / synthetic)
Ranking RankThenMean (feedback only) RankThenMean (official ranking)
Visibility Open leaderboard Final ranking after test phase closes

🧮 Handling Missing or Corrupt Submissions

If a submission fails to produce a reconstruction for a patient, the platform reports the missing value. If a resubmission is still incomplete or corrupt, that patient is reconstructed as a black image (air only), which scores accordingly. During the test phase, dockerized algorithms run on the full set, so a submission that passed the preliminary phase will produce all cases.