Skip to content

Candidate metrics #10

@rcannood

Description

@rcannood

This is a review of candidate metrics for evaluating spatial cell segmentation methods against ground truth. The benchmark exposes two data structures to metrics:

  • file_spatial_solution: cell_labels label image (ground truth), cell_boundaries shapes, table with cell_id, region, cell_area, transcript_counts
  • file_processed_prediction: segmentation label image (predicted), table with cell_id, region, counts/normalized/normalized_log/normalized_log_scaled layers

Disclaimer: This overview was generated with the assistance of GitHub Copilot and may contain inaccurate or incomplete information.


Overview

Priority Metric Rationale
1 Panoptic Quality (PQ) Accepted standard in spatial biology; captures detection and boundary quality in one score
2 ARI on transcript assignments Directly measures transcript-level assignment quality; complements PQ
3 F1 / Precision / Recall Simple and interpretable; exposes over- vs under-segmentation
4 Cell-type purity Measures biological relevance of the segmentation
5 AP@[0.5:0.95] More thorough picture at multiple IoU thresholds
6 Cell area distribution Diagnostic / sanity check only
7 Silhouette score Too indirect; confounded by normalization choices

Suggestions and corrections are very welcome; particularly around literature references and whether any important metrics have been missed.

1. Panoptic Quality (PQ)

Property Value
Complexity Medium
Literature acceptance High
Evaluated on Label images (cell_labels vs segmentation)
Type Image analysis
Fields used cell_labels (solution), segmentation (prediction)

PQ = Detection Quality × Segmentation Quality. A predicted cell is matched to a ground truth cell if their IoU exceeds 0.5. DQ is the F1 of matched pairs; SQ is the mean IoU of matched pairs. The combined score ranges from 0 to 1.

Used as a standard in instance segmentation (COCO benchmark) and increasingly in spatial biology (e.g. Greenwald et al. 2022 Nature Biotechnology, Pachitariu & Stringer 2022 Nature Methods).

Pros: Penalises both false positives and over/under-segmentation in a single interpretable score.
Cons: Requires pixel-level coordinate alignment between prediction and ground truth; sensitive to coordinate transformations.


2. Average Precision (AP) at multiple IoU thresholds

Property Value
Complexity Medium
Literature acceptance High
Evaluated on Label images (cell_labels vs segmentation)
Type Image analysis
Fields used cell_labels (solution), segmentation (prediction)

Computes precision–recall across a range of IoU thresholds (e.g. 0.5–0.95 in steps of 0.05) and averages the area under each PR curve. mAP@0.5 is the most commonly reported single number.

De-facto standard in instance segmentation and used in cell segmentation benchmarks such as CellSeg (Ma et al. 2024).

Pros: More complete picture than a single IoU cutoff; separates performance at strict and lenient thresholds.
Cons: Computationally heavier; less intuitive than PQ as a single summary number.


3. Adjusted Rand Index (ARI) on Transcript Assignments

Property Value
Complexity Low
Literature acceptance Medium
Evaluated on Per-transcript cell assignments derived from label images
Type Clustering / transcript assignment
Fields used cell_labels + transcripts point coordinates (solution) vs segmentation + transcripts (prediction)

Treats transcript-to-cell assignment as a clustering problem. For each transcript, its cell ID is looked up in both the ground truth and predicted label image. ARI is computed between the two resulting assignment vectors. ARI = 1 is perfect agreement; 0 is random.

Used in transcript-assignment benchmarks (e.g. Petukhov et al. 2022 Nature Biotechnology).

Pros: Directly measures what matters for downstream analysis (which transcripts belong to which cell); no need for pixel-level IoU.
Cons: Sensitive to how background transcripts (cell_id = 0) are handled; ignores cell shape beyond assignment.


4. F1 / Precision / Recall on Cell Detection

Property Value
Complexity Low
Literature acceptance Medium
Evaluated on Set of detected cell instances
Type Detection
Fields used cell_labels (solution), segmentation (prediction), IoU matching at fixed threshold

At a fixed IoU threshold (typically 0.5), precision is the fraction of predicted cells matched to a ground truth cell; recall is the fraction of ground truth cells that were detected. F1 is their harmonic mean.

Pros: Simple and interpretable; exposes whether a method over- or under-segments.
Cons: Single threshold; does not capture boundary quality beyond the threshold. Largely subsumed by PQ (whose DQ component is equivalent).


5. Cell-type Purity (Downstream Biological Quality)

Property Value
Complexity High
Literature acceptance Medium–High
Evaluated on Expression matrix + scRNA-seq reference for label transfer
Type Biological relevance
Fields used counts / normalized_log layers (prediction), scRNA-seq reference for cell-type annotation

After assigning cell types to predicted cells (e.g. by label transfer from the scRNA-seq reference), measure purity of the resulting clusters relative to ground truth cell-type annotations. Possible statistics include mean cell-type entropy per cluster or adjusted mutual information.

Used in txsim (Kleshchevnikov et al.) and Squidpy-based evaluations.

Pros: Captures biological meaningfulness: perfect pixel overlap can still yield mixed transcriptome profiles if boundaries are slightly off.
Cons: Requires an accurate scRNA-seq reference and a label-transfer step, both of which introduce additional noise.


6. Silhouette Score of Expression Profiles

Property Value
Complexity Low–Medium
Literature acceptance Low–Medium
Evaluated on Normalized expression matrix
Type Data quality
Fields used normalized_log or normalized_log_scaled layer, cell-type labels

Computes the silhouette coefficient of cells in PCA space, grouped by ground truth cell type. A higher score means same-type cells cluster together, indicating clean type-specific expression profiles.

Pros: Does not require pixel-level comparison; captures transcriptomic cohesion.
Cons: Confounded by normalization and HVG selection choices; hard to attribute changes to segmentation quality alone.


7. Cell Area Distribution Similarity

Property Value
Complexity Low
Literature acceptance Low
Evaluated on Per-cell area statistics
Type Diagnostic
Fields used** cell_area (solution table), cell area from predicted segmentation label image

Compares the distribution of predicted cell sizes to the ground truth distribution using e.g. Wasserstein distance or Jensen–Shannon divergence. A method that produces systematically larger or smaller cells will show a distributional mismatch.

Pros: Very fast; useful for diagnosing systematic over/under-segmentation bias.
Cons: A method can match the distribution while still misidentifying individual cells; not sufficient as a primary metric.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions