Evaluation & Ranking - autoPET/CT IV

🤺 Evaluation criteria¶

Evaluation will be performed on held-out test cases. For evaluation, a combination of six metrics will be used, reflecting the aims and specific challenges of PET/CT lesion detection and segmentation, and that are formed of three basic measures:

Foreground Dice score (DSC) of segmented lesions
False positive volume (FPV): Volume of false positive connected components that do not overlap with positives
False negative volume (FNV): Volume of positive connected components in the ground truth that do not overlap with the estimated segmentation mask

Figure: Example for the evaluation. The Dice score is calculated to measure the correct overlap between predicted lesion segmentation (blue) and ground truth (red). Additionally special emphasis is put on false negatives by measuring their volume (i.e. entirely missed lesions) and on false positives by measuring their volume (i.e. large false positive volumes like brain or bladder will result in a low score).

A python script computing these evaluation metrics is provided under https://github.com/lab-midas/autoPETCTIV.

📋 Task 1: Single-staging whole-body PET/CT¶

For Task 1, we will evaluate the following metrics:

DSC @last: Interaction Efficacy of DSC at last interactive segmentation step
FPV @last: Interaction Efficacy of FPV at last interactive segmentation step
FNV @last: Interaction Efficacy of FNV at last interactive segmentation step
AUC-DSC: Interaction Utilization as Area under the curve for DSC
AUC-FPV: Interaction Utilization as Area under the curve for FPV
AUC-FNV: Interaction Utilization as Area under the curve for FNV

DSC, FPV, and FNV will be evaluated iteratively over 11 interactive segmentation steps. In each step, an additional standardized and pre-simulated tumor (foreground) and background click, represented as a set of 3D coordinates, will be provided alongside the input image. This process will progress incrementally from 0 clicks to the full allocation of 10 tumor and 10 background clicks per image.

Metrics 1-3 assess the final segmentation quality achieved after incorporating all clicks in the last (11th) interactive segmentation iteration. It reflects the performance of the model in producing accurate annotations after completing the full interaction process.

Metrics 4-6 evaluate the AUC for DSC, FPV, and FNV based on the model’s intermediate predictions after each interactive segmentation step. The AUC is calculated using the trapezoidal rule, where the x-axis represents the interactive segmentation step (0 to 10) and the y-axis represents the corresponding metric value at each step. This metric quantifies how efficiently a model utilizes the additional information in the form of clicks to achieve a clinically relevant segmentation, measuring how quickly accurate annotations are produced as user clicks are incrementally added.

In case of test data that do not contain positives (no lesions), only metric 2 will be used.

📋 Task 2: Longitudinal CT¶

For Task 2, we will evaluate the following metrics:

DSC @last: Interaction Efficacy of DSC with given lesion center in follow-up scan
FPV @last: Interaction Efficacy of FPV with given lesion center in follow-up scan
FNV @last: Interaction Efficacy of FNV with given lesion center in follow-up scan

DSC, FPV, and FNV will be evaluated with standardized and pre-simulated tumor lesion center click in the follow-up scan (one click per lesion) and will be calculated per lesion. The lesion center click is represented as a set of 3D coordinates matching the image.

In case of test data that do not contain positives (no lesions), only metric 2 will be used.

📈 Ranking¶

📋 Task 1: Single-staging whole-body PET/CT¶

We divide the test dataset into subsets based on center and tracer (i.e., PSMA LMU, PSMA UKT, FDG LMU, FDG UKT) and calculated the average metrics for Interaction Efficacy (DSC @last, FPV @last, FNV @last) and the Interaction Utilization (AUC-DSC: higher = better, AUC-FPV: lower = better, AUC-FNV: lower = better) within each subset. Then, we average the subset averages. For each of the six metrics, we compute the rank over all submissions. Finally, we generated the overall rank by combining the six metric ranks using a weighting factor: DSC @last (0.25), FPV @last (0.125), FNV @last (0.125), AUC-DSC (0.25), AUC-FPV (0.125), AUC-FNV (0.125).

📋 Task 2: Longitudinal CT¶

For each test case, we will compute the average metrics for Interaction Efficacy (DSC @last: higher = better, FPV @last: lower = better, FNV @last: lower = better) with provided center lesion click, producing three averaged metrics over the lesions per submission. Then, we compute seperate rankings for each of the three averaged metrics. Finally, we generated the overall rank by combining the three metric ranks using a weighting factor: DSC @last (0.5), FPV @last (0.25), FNV @last (0.25).