🤺 Evaluation criteria¶
Evaluation will be performed on held-out test cases. For evaluation, a combination of six metrics will be used, reflecting the aims and specific challenges of PET/CT lesion detection and segmentation, and that are formed of three basic measures:
- Foreground Dice score (DSC) of segmented lesions
- False positive volume (FPV): Volume of false positive connected components that do not overlap with positives
- False negative volume (FNV): Volume of positive connected components in the ground truth that do not overlap with the estimated segmentation mask
Figure: Example for the evaluation. The Dice score is calculated to
measure the correct overlap between predicted lesion segmentation (blue)
and ground truth (red). Additionally special emphasis is put on false
negatives by measuring their volume (i.e. entirely missed lesions)
 and on false positives by measuring their volume (i.e. large false
positive volumes like brain or bladder will result in a low score).
A python script computing these evaluation metrics is provided under https://github.com/lab-midas/autoPETCTIV.
📋 Task 1: Single-staging whole-body PET/CT¶
For Task 1, we will evaluate the following metrics:
- DSC @last: Interaction Efficacy of DSC at last interactive segmentation step
- FPV @last: Interaction Efficacy of FPV at last interactive segmentation step
- FNV @last: Interaction Efficacy of FNV at last interactive segmentation step
- AUC-DSC: Interaction Utilization as Area under the curve for DSC
- AUC-FPV: Interaction Utilization as Area under the curve for FPV
- AUC-FNV: Interaction Utilization as Area under the curve for FNV
DSC, FPV, and FNV will be evaluated iteratively over 11 interactive segmentation steps. In each step, an additional standardized and pre-simulated tumor (foreground) and background click, represented as a set of 3D coordinates, will be provided alongside the input image. This process will progress incrementally from 0 clicks to the full allocation of 10 tumor and 10 background clicks per image.
Metrics 1-3 assess the final segmentation quality achieved after incorporating all clicks in the last (11th) interactive segmentation iteration. It reflects the performance of the model in producing accurate annotations after completing the full interaction process.
Metrics 4-6 evaluate the AUC for DSC, FPV, and FNV based on the model’s intermediate predictions after each interactive segmentation step. The AUC is calculated using the trapezoidal rule, where the x-axis represents the interactive segmentation step (0 to 10) and the y-axis represents the corresponding metric value at each step. This metric quantifies how efficiently a model utilizes the additional information in the form of clicks to achieve a clinically relevant segmentation, measuring how quickly accurate annotations are produced as user clicks are incrementally added.
In case of test data that do not contain positives (no lesions), only metric 2 will be used. For such volumes without tumors, we will also only provide background clicks for all interaction steps.
📋 Task 2: Longitudinal CT screening¶
For Task 2, we will evaluate the following metrics:
- DSC @last: Interaction Efficacy of DSC with given lesion center in follow-up scan
- FPV @last: Interaction Efficacy of FPV with given lesion center in follow-up scan
- FNV @last: Interaction Efficacy of FNVwith given lesion center in follow-up scan
- DSC @init: Interaction Efficacy of DSC without given lesion center in follow-up scan
- FPV @init: Interaction Efficacy of FPV without given lesion center in follow-up scan
- FNV @init: Interaction Efficacy of FNV without given lesion center in follow-up scan
DSC, FPV, and FNV will be evaluated with (metrics 1-3) and without (metrics 4-6) standardized and pre-simulated tumor lesion center click in the follow-up scan (one click per lesion). The lesion center click is represented as a set of 3D coordinates matching the image.
In case of test data that do not contain positives (no lesions), only metric 2 and 5 will be used.
📈 Ranking¶
📋 Task 1: Single-staging whole-body PET/CT¶
We divide the test dataset into subsets based on center and tracer (i.e., PSMA LMU, PSMA UKT, FDG LMU, FDG UKT) and calculated the average metrics for Interaction Efficacy (DSC @last, FPV @last, FNV @last) and the Interaction Utilization (AUC-DSC: higher = better, AUC-FPV: lower = better, AUC-FNV: lower = better) within each subset. Then, we average the subset averages. For each of the six metrics, we compute the rank over all submissions. Finally, we generated the overall rank by combining the six metric ranks using a weighting factor: DSC @last (0.25), FPV @last (0.125), FNV @last (0.125), AUC-DSC (0.25), AUC-FPV (0.125), AUC-FNV (0.125).
📋 Task 2: Longitudinal CT screening¶
For each test case, we will compute the average metrics for Interaction Efficacy with (DSC @last: higher = better, FPV @last: lower = better, FNV @last: lower = better) and without (DSC @init: higher = better, FPV @init: lower = better, FNV @init: lower = better) provided center lesion click, producing six averaged metrics per submission. Then, we compute seperate rankings for each of the six averaged metrics. Finally, we generated the overall rank by combining the six metric ranks using a weighting factor: DSC @last (0.25), FPV @last (0.125), FNV @last (0.125), DSC @init (0.25), FPV @init (0.125), FNV @init (0.125).