From 6700478177f6322a17c6faa10ba8eb01882a4417 Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Thu, 11 Jun 2026 11:06:53 -0400 Subject: [PATCH] Add GrandQC slide quality control notebook for IDC DICOM WSI Demonstrates end-to-end use of GrandQC with NCI Imaging Data Commons: - Discover H&E slides via idc-index sm_index metadata - Patch GrandQC scripts for native DICOM directory support via OpenSlide 4.0 - Download TCGA-BRCA DICOM series and run tissue detection + artifact segmentation - Visualize results using GrandQC's own wsi_colors.py palette - Validate the DICOM-based pipeline against GrandQC's pre-computed TCGA masks (zenodo.org/records/14041578) by comparing per-class artifact fractions Co-Authored-By: Claude Sonnet 4.6 --- .../grandqc_slide_quality_with_idc.ipynb | 1135 +++++++++++++++++ 1 file changed, 1135 insertions(+) create mode 100644 notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb new file mode 100644 index 0000000..d505c19 --- /dev/null +++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb @@ -0,0 +1,1135 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Slide Quality Control with GrandQC and NCI Imaging Data Commons\n", + "\n", + "Automated quality control (QC) is an essential preprocessing step in computational pathology pipelines. Artifacts such as tissue folds, out-of-focus regions, pen markings, and air bubbles can introduce noise that degrades the performance of downstream AI models.\n", + "\n", + "[**GrandQC**](https://github.com/cpath-ukk/grandqc) is an open-source tool for automated tissue detection and multi-class artifact segmentation in whole slide images (WSIs). It was validated across slides from 19 international pathology departments and published in:\n", + "\n", + "> Weng Z., Seper A., Pryalukhin A., et al. *GrandQC: A comprehensive solution to quality control problem in digital pathology.* Nature Communications 15, 10685 (2024). https://doi.org/10.1038/s41467-024-54769-y\n", + "\n", + "[**NCI Imaging Data Commons (IDC)**](https://portal.imaging.datacommons.cancer.gov/) hosts ~100 TB of cancer imaging data, including thousands of H&E-stained whole slide images from TCGA, CPTAC, HTAN, and other programs — all stored as DICOM and freely accessible without authentication.\n", + "\n", + "This notebook demonstrates how to:\n", + "1. **Discover** H&E-stained whole slide images in IDC using `idc-index`\n", + "2. **Set up** GrandQC and download its pre-trained models\n", + "3. **Download** slides from IDC and run GrandQC directly on DICOM files\n", + "4. **Visualize** tissue detection and artifact segmentation results\n", + "5. **Leverage** GrandQC's pre-computed QC masks for all 32 TCGA cohorts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Disclaimer\n", + "\n", + "The code and data of this repository are provided to promote reproducible research. They are not intended for clinical care or commercial use.\n", + "\n", + "The software is provided \"as is\", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.\n", + "\n", + "**GrandQC license**: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0). Non-commercial research use only; citation of the GrandQC paper is required." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 1: Discover H&E Slides in IDC\n", + "\n", + "IDC provides the `idc-index` Python package for querying slide metadata and downloading DICOM files. We start by installing it and exploring the available whole slide image collections." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install --upgrade idc-index openslide-python openslide-bin" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import matplotlib.patches as mpatches\n", + "import numpy as np\n", + "from IPython.display import IFrame, display\n", + "from idc_index import IDCClient\n", + "\n", + "idc_client = IDCClient()\n", + "print(f\"IDC data version: {idc_client.get_idc_version()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1.1 Overview of Slide Microscopy Collections\n", + "\n", + "IDC stores whole slide images using the DICOM Slide Microscopy (SM) standard. The `sm_index` table extends the main metadata index with pathology-specific attributes including staining protocol, tissue type, pixel spacing, and image dimensions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Load the slide microscopy index\n", + "idc_client.fetch_index(\"sm_index\")\n", + "\n", + "# Summary of all SM collections\n", + "sm_collections = idc_client.sql_query(\"\"\"\n", + " SELECT\n", + " i.collection_id,\n", + " COUNT(DISTINCT i.PatientID) AS patients,\n", + " COUNT(DISTINCT i.SeriesInstanceUID) AS series,\n", + " ROUND(SUM(i.series_size_MB) / 1024.0, 1) AS size_GB,\n", + " i.license_short_name\n", + " FROM index i\n", + " WHERE i.Modality = 'SM'\n", + " GROUP BY i.collection_id, i.license_short_name\n", + " ORDER BY patients DESC\n", + "\"\"\")\n", + "\n", + "print(f\"Total SM collections: {len(sm_collections)}\")\n", + "print(f\"Total SM series: {sm_collections['series'].sum():,}\")\n", + "print(f\"Total size: {sm_collections['size_GB'].sum():.0f} GB\")\n", + "print()\n", + "sm_collections.head(15)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1.2 Filter for H&E-Stained Slides\n", + "\n", + "The `sm_index` table includes a `staining_usingSubstance_CodeMeaning` column (stored as an array) that records the staining protocol for each slide. We filter for slides stained with hematoxylin, which identifies H&E preparations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Count H&E slides per collection\n", + "he_by_collection = idc_client.sql_query(\"\"\"\n", + " SELECT\n", + " i.collection_id,\n", + " COUNT(DISTINCT i.PatientID) AS patients,\n", + " COUNT(DISTINCT i.SeriesInstanceUID) AS he_series,\n", + " ROUND(SUM(i.series_size_MB) / 1024.0, 1) AS size_GB\n", + " FROM index i\n", + " JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID\n", + " WHERE array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'\n", + " GROUP BY i.collection_id\n", + " ORDER BY he_series DESC\n", + "\"\"\")\n", + "\n", + "print(f\"Collections with H&E slides: {len(he_by_collection)}\")\n", + "print(f\"Total H&E series: {he_by_collection['he_series'].sum():,}\")\n", + "he_by_collection.head(15)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1.3 Select Slides for Quality Control\n", + "\n", + "For this tutorial we work with slides from the [TCGA-BRCA](https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=tcga_brca) collection (The Cancer Genome Atlas — Breast Cancer). This collection is a good choice because:\n", + "\n", + "- It contains over 3,000 H&E-stained slides at 40× magnification\n", + "- GrandQC provides **pre-computed QC masks for all TCGA cohorts** (demonstrated in Part 6)\n", + "- Slides span both tumor and adjacent normal tissue, providing variety in tissue composition\n", + "\n", + "We query the `sm_index` to retrieve per-slide metadata including the `ContainerIdentifier` (the original TCGA slide barcode), which we will later use to match IDC series against the pre-computed GrandQC masks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Query TCGA-BRCA H&E slides with key metadata\n", + "tcga_brca_he = idc_client.sql_query(\"\"\"\n", + " SELECT\n", + " i.SeriesInstanceUID,\n", + " i.PatientID,\n", + " s.ContainerIdentifier,\n", + " s.primaryAnatomicStructureModifier_CodeMeaning AS tissue_type,\n", + " s.max_TotalPixelMatrixColumns AS width_px,\n", + " s.max_TotalPixelMatrixRows AS height_px,\n", + " s.min_PixelSpacing_2sf AS pixel_spacing_mm,\n", + " s.ObjectiveLensPower AS objective_power,\n", + " ROUND(i.series_size_MB, 1) AS size_MB\n", + " FROM index i\n", + " JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID\n", + " WHERE i.collection_id = 'tcga_brca'\n", + " AND array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'\n", + " ORDER BY i.series_size_MB ASC\n", + "\"\"\")\n", + "\n", + "print(f\"TCGA-BRCA H&E slides: {len(tcga_brca_he)}\")\n", + "print(f\"Tissue types: {tcga_brca_he['tissue_type'].value_counts().to_dict()}\")\n", + "print(f\"\\nSize range: {tcga_brca_he['size_MB'].min()} – {tcga_brca_he['size_MB'].max()} MB\")\n", + "tcga_brca_he.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1.4 Select a Representative Set of Slides\n", + "\n", + "For the hands-on GrandQC run we select a small set of slides that covers:\n", + "- Different tissue types (tumor vs. normal)\n", + "- Different magnifications (20× and 40×)\n", + "- Manageable file sizes for a notebook environment\n", + "\n", + "We cap the selection at slides under 200 MB to keep download and processing times reasonable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Pick up to 5 small slides with variety in tissue type and magnification\n", + "demo_slides = (\n", + " tcga_brca_he[tcga_brca_he[\"size_MB\"] < 200]\n", + " .drop_duplicates(subset=\"tissue_type\")\n", + " .head(5)\n", + " .reset_index(drop=True)\n", + ")\n", + "\n", + "# Fall back to the 5 smallest slides if tissue_type is missing\n", + "if len(demo_slides) < 3:\n", + " demo_slides = tcga_brca_he[tcga_brca_he[\"size_MB\"] < 200].head(5).reset_index(drop=True)\n", + "\n", + "print(f\"Selected {len(demo_slides)} slides for GrandQC processing:\")\n", + "display(\n", + " demo_slides[[\n", + " \"PatientID\", \"ContainerIdentifier\", \"tissue_type\",\n", + " \"objective_power\", \"width_px\", \"height_px\", \"size_MB\"\n", + " ]]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Preview slides in the IDC SLIM viewer (pathology viewer)\n", + "for _, row in demo_slides.iterrows():\n", + " url = idc_client.get_viewer_URL(seriesInstanceUID=row[\"SeriesInstanceUID\"])\n", + " print(f\"{row['ContainerIdentifier']} ({row['tissue_type']}, {row['size_MB']} MB)\")\n", + " print(f\" {url}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 2: GrandQC Setup\n", + "\n", + "Before running GrandQC we need to:\n", + "1. Clone the repository from GitHub\n", + "2. Install its Python dependencies\n", + "3. Download the pre-trained models from Zenodo\n", + "4. Apply a small patch so both scripts accept IDC's DICOM series directories alongside conventional WSI files\n", + "\n", + "> **Note**: GrandQC's scripts currently iterate over files in a flat folder. IDC downloads each DICOM series into its own subdirectory. The patch below (≤ 10 lines changed across 2 files) teaches GrandQC to recognise a directory whose contents end in `.dcm` as a valid slide entry and resolves the OpenSlide path accordingly. No model weights or algorithm logic are changed." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2.1 Clone and Install Dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "# Clone GrandQC (shallow clone — we only need the latest commit)\n", + "!git clone --depth 1 https://github.com/cpath-ukk/grandqc.git\n", + "\n", + "# Install GrandQC Python dependencies.\n", + "# PyTorch is already present in Colab; we install the remaining packages.\n", + "# segmentation-models-pytorch 0.3.1 is pinned because its model.predict() API\n", + "# was removed in 0.4.0 and GrandQC relies on it.\n", + "!pip install -q \\\n", + " \"segmentation-models-pytorch==0.3.1\" \\\n", + " \"rasterio>=1.3\" \\\n", + " \"imagecodecs\" \\\n", + " \"zarr<3\" \\\n", + " \"tifffile\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2.2 Download Pre-trained Models" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "GRANDQC_SCRIPTS = \"grandqc/01_WSI_inference_OPENSLIDE_QC\"\n", + "MODELS_TD = os.path.join(GRANDQC_SCRIPTS, \"models\", \"td\")\n", + "MODELS_QC = os.path.join(GRANDQC_SCRIPTS, \"models\", \"qc\")\n", + "os.makedirs(MODELS_TD, exist_ok=True)\n", + "os.makedirs(MODELS_QC, exist_ok=True)\n", + "\n", + "# Tissue detection model (~27 MB) — Zenodo record 14507273\n", + "!wget -q --show-progress -O {MODELS_TD}/Tissue_Detection_MPP10.pth \\\n", + " \"https://zenodo.org/records/14507273/files/Tissue_Detection_MPP10.pth?download=1\"\n", + "\n", + "# Artifact segmentation model at MPP=1.5 (~25 MB) — Zenodo record 14041538\n", + "# MPP=1.5 (≈7× objective) is the recommended default for most scanners.\n", + "!wget -q --show-progress -O {MODELS_QC}/GrandQC_MPP15.pth \\\n", + " \"https://zenodo.org/records/14041538/files/GrandQC_MPP15.pth?download=1\"\n", + "\n", + "print(\"Models downloaded:\")\n", + "for f in [f\"{MODELS_TD}/Tissue_Detection_MPP10.pth\", f\"{MODELS_QC}/GrandQC_MPP15.pth\"]:\n", + " size_mb = os.path.getsize(f) / 1024**2\n", + " print(f\" {f} ({size_mb:.1f} MB)\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2.3 Patch Scripts for DICOM Directory Support\n", + "\n", + "IDC downloads each series as a directory of tiled DICOM files (`*.dcm`). OpenSlide 4.0+ can open any one of those files and reconstruct the full pyramid from the same directory.\n", + "\n", + "The two changes per script are:\n", + "\n", + "| Script | Change |\n", + "|--------|--------|\n", + "| `wsi_tis_detect.py` | Replace file-only `os.listdir` filter with one that also yields subdirectories containing `.dcm` files; resolve the path before calling `OpenSlide()` |\n", + "| `main.py` | Same two changes using `open_slide()` |\n", + "\n", + "Both scripts also get automatic CUDA/MPS/CPU device selection instead of the hardcoded `'cuda'`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "def patch_grandqc_for_dicom(script_path: str) -> None:\n", + " text = Path(script_path).read_text()\n", + " original = text\n", + "\n", + " # ── 1. Auto-detect compute device ──────────────────────────────────────\n", + " text = text.replace(\n", + " \"DEVICE = 'cuda'\",\n", + " \"import torch as _t; DEVICE = ('cuda' if _t.cuda.is_available() \"\n", + " \"else ('mps' if _t.backends.mps.is_available() else 'cpu'))\",\n", + " )\n", + "\n", + " # ── 2. Replace slide-list builder to also accept DICOM directories ──────\n", + " DICOM_ITER = (\n", + " \"def _iter_slides(d):\\n\"\n", + " \" for n in sorted(os.listdir(d)):\\n\"\n", + " \" p = os.path.join(d, n)\\n\"\n", + " \" if os.path.isfile(p) or (\\n\"\n", + " \" os.path.isdir(p) and\\n\"\n", + " \" any(x.lower().endswith('.dcm') for x in os.listdir(p))):\\n\"\n", + " \" yield n\\n\"\n", + " )\n", + "\n", + " # wsi_tis_detect.py pattern\n", + " OLD_LIST_TIS = (\n", + " \"slide_names = sorted([f for f in os.listdir(SLIDE_DIR) \"\n", + " \"if os.path.isfile(os.path.join(SLIDE_DIR, f))])\"\n", + " )\n", + " NEW_LIST_TIS = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n", + "\n", + " # main.py pattern\n", + " OLD_LIST_MAIN = \"slide_names = sorted(os.listdir(SLIDE_DIR))\"\n", + " NEW_LIST_MAIN = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n", + "\n", + " text = text.replace(OLD_LIST_TIS, NEW_LIST_TIS)\n", + " text = text.replace(OLD_LIST_MAIN, NEW_LIST_MAIN)\n", + "\n", + " # ── 3. Resolve DICOM directory to first .dcm file before open ───────────\n", + " RESOLVE = (\n", + " \" if os.path.isdir(path_slide):\\n\"\n", + " \" _dcms = sorted(x for x in os.listdir(path_slide)\"\n", + " \" if x.lower().endswith('.dcm'))\\n\"\n", + " \" path_slide = os.path.join(path_slide, _dcms[0])\\n\"\n", + " )\n", + "\n", + " for open_call in (\"slide = OpenSlide(path_slide)\", \"slide = open_slide(path_slide)\"):\n", + " old = f\" path_slide = os.path.join(SLIDE_DIR, slide_name)\\n {open_call}\"\n", + " new = f\" path_slide = os.path.join(SLIDE_DIR, slide_name)\\n{RESOLVE} {open_call}\"\n", + " text = text.replace(old, new)\n", + "\n", + " if text == original:\n", + " print(f\" WARNING: no changes applied to {script_path}\")\n", + " else:\n", + " Path(script_path).write_text(text)\n", + " n = sum(1 for a, b in zip(original.splitlines(), text.splitlines()) if a != b)\n", + " n += abs(len(text.splitlines()) - len(original.splitlines()))\n", + " print(f\" Patched {script_path} (+{n} lines changed)\")\n", + "\n", + "\n", + "for script in [\n", + " f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\",\n", + " f\"{GRANDQC_SCRIPTS}/main.py\",\n", + "]:\n", + " patch_grandqc_for_dicom(script)\n", + "print(\"Done.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Sanity-check: confirm the DICOM resolver was inserted in both scripts\n", + "for script in [f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\", f\"{GRANDQC_SCRIPTS}/main.py\"]:\n", + " text = Path(script).read_text()\n", + " has_dcm = \"_iter_slides\" in text\n", + " has_res = \"_dcms\" in text\n", + " has_dev = \"cuda.is_available\" in text\n", + " print(f\"{script.split('/')[-1]:30s} dicom_iter={has_dcm} dcm_resolve={has_res} auto_device={has_dev}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 3: Download Slides from IDC\n", + "\n", + "`idc_client.download_from_selection()` fetches DICOM files for the requested series and places each series in its own subdirectory named by `SeriesInstanceUID`.\n", + "\n", + "After downloading we rename each subdirectory to its `ContainerIdentifier` (the original TCGA slide barcode). This ensures that GrandQC's output files and the pre-computed TCGA mask filenames are consistently keyed by the human-readable barcode rather than an opaque UID." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3.1 Download DICOM Series" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "SLIDE_DIR = \"slides\"\n", + "os.makedirs(SLIDE_DIR, exist_ok=True)\n", + "\n", + "print(f\"Downloading {len(demo_slides)} slides → {SLIDE_DIR}/\")\n", + "print(\"(This may take a few minutes depending on file sizes and network speed)\\n\")\n", + "\n", + "idc_client.download_from_selection(\n", + " downloadDir=SLIDE_DIR,\n", + " seriesInstanceUID=demo_slides[\"SeriesInstanceUID\"].tolist(),\n", + ")\n", + "print(\"\\nDownload complete.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3.2 Rename Series Directories to TCGA Barcodes\n", + "\n", + "GrandQC uses the directory/file name as the key for all output files (tissue mask, QC map, overlay, TSV report row). By renaming each downloaded series directory from its `SeriesInstanceUID` to its `ContainerIdentifier` (TCGA barcode), we get:\n", + "\n", + "- Human-readable output filenames (`TCGA-A7-A26J-01B-02-BS2_MASK.png` instead of a UID)\n", + "- Automatic matching with GrandQC's pre-computed TCGA masks (Part 6), which are keyed by the same barcode" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Build SeriesInstanceUID → ContainerIdentifier map\n", + "uid_to_barcode = dict(zip(demo_slides[\"SeriesInstanceUID\"], demo_slides[\"ContainerIdentifier\"]))\n", + "\n", + "for uid, barcode in uid_to_barcode.items():\n", + " src = os.path.join(SLIDE_DIR, uid)\n", + " dst = os.path.join(SLIDE_DIR, barcode)\n", + " if os.path.isdir(src):\n", + " os.rename(src, dst)\n", + " print(f\" {uid} → {barcode}\")\n", + " elif os.path.isdir(dst):\n", + " print(f\" Already renamed: {barcode}\")\n", + " else:\n", + " print(f\" WARNING: directory not found: {uid}\")\n", + "\n", + "print(\"\\nSlide directories after renaming:\")\n", + "for d in sorted(os.listdir(SLIDE_DIR)):\n", + " dpath = os.path.join(SLIDE_DIR, d)\n", + " if os.path.isdir(dpath):\n", + " n = sum(1 for f in os.listdir(dpath) if f.lower().endswith(\".dcm\"))\n", + " print(f\" {d}/ ({n} .dcm files)\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3.3 Verify OpenSlide Can Read the DICOM Files\n", + "\n", + "OpenSlide 4.0 (bundled in `openslide-bin`) reads DICOM WSI files natively. Opening any one `.dcm` file from a series directory is sufficient — OpenSlide discovers the remaining pyramid levels from the other files in the same directory that share the same `SeriesInstanceUID`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import openslide\n", + "\n", + "print(f\"OpenSlide version: {openslide.__library_version__}\\n\")\n", + "\n", + "for barcode in uid_to_barcode.values():\n", + " slide_dir = os.path.join(SLIDE_DIR, barcode)\n", + " dcm_files = sorted(f for f in os.listdir(slide_dir) if f.lower().endswith(\".dcm\"))\n", + " if not dcm_files:\n", + " print(f\" {barcode}: no .dcm files found\")\n", + " continue\n", + " entry = os.path.join(slide_dir, dcm_files[0])\n", + " try:\n", + " slide = openslide.OpenSlide(entry)\n", + " w, h = slide.level_dimensions[0]\n", + " mpp = slide.properties.get(\"openslide.mpp-x\", \"N/A\")\n", + " vendor = slide.properties.get(\"openslide.vendor\", \"N/A\")\n", + " print(f\" {barcode}\")\n", + " print(f\" {w:,} × {h:,} px | MPP={mpp} | vendor={vendor} | levels={slide.level_count}\")\n", + " slide.close()\n", + " except Exception as exc:\n", + " print(f\" {barcode}: ERROR — {exc}\")\n", + "\n", + "print(\"\\nAll slides verified.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 4: Run GrandQC Quality Control\n", + "\n", + "GrandQC runs as two sequential scripts, both inside `grandqc/01_WSI_inference_OPENSLIDE_QC/`:\n", + "\n", + "| Step | Script | MPP | Time (GPU) | Output |\n", + "|------|--------|-----|------------|--------|\n", + "| 1 | `wsi_tis_detect.py` | 10 | ~0.4 s/slide | Binary tissue masks |\n", + "| 2 | `main.py` | 1.5 | ~30–45 s/slide | 7-class artifact maps, GeoJSON, TSV report |\n", + "\n", + "The 7 artifact classes are: **1** clean tissue · **2** folds · **3** dark spots · **4** pen marks · **5** air bubbles/edges · **6** out-of-focus · **7** background.\n", + "\n", + "Both scripts load models via relative paths (`./models/…`) and must therefore be invoked with their directory as the working directory. We use `subprocess` with `cwd=GRANDQC_SCRIPTS` to achieve this while keeping output streaming to the cell." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.1 Step 1 — Tissue Detection\n", + "\n", + "Produces a low-resolution binary mask (tissue vs. background) for each slide. The mask is saved as a PNG and re-used by Step 2 to skip background patches and cut inference time." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess, sys\n", + "\n", + "OUTPUT_DIR = \"qc_output\"\n", + "os.makedirs(OUTPUT_DIR, exist_ok=True)\n", + "\n", + "slide_dir_abs = os.path.abspath(SLIDE_DIR)\n", + "output_dir_abs = os.path.abspath(OUTPUT_DIR)\n", + "scripts_abs = os.path.abspath(GRANDQC_SCRIPTS)\n", + "\n", + "print(f\"Slides: {slide_dir_abs}\")\n", + "print(f\"Output: {output_dir_abs}\")\n", + "print(f\"Scripts dir: {scripts_abs}\")\n", + "print()\n", + "print(\"Running wsi_tis_detect.py ...\")\n", + "\n", + "result = subprocess.run(\n", + " [sys.executable, \"wsi_tis_detect.py\",\n", + " \"--slide_folder\", slide_dir_abs,\n", + " \"--output_dir\", output_dir_abs],\n", + " cwd=scripts_abs,\n", + ")\n", + "\n", + "if result.returncode != 0:\n", + " print(f\"\\nERROR: wsi_tis_detect.py exited with code {result.returncode}\")\n", + "else:\n", + " mask_dir = os.path.join(output_dir_abs, \"tis_det_mask\")\n", + " masks = sorted(os.listdir(mask_dir))\n", + " print(f\"\\nTissue detection complete — {len(masks)} mask(s) written to {mask_dir}/\")\n", + " for m in masks:\n", + " print(f\" {m}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.2 Step 2 — Artifact Segmentation\n", + "\n", + "Runs the 7-class artifact model at MPP=1.5. For each slide it produces:\n", + "\n", + "- `maps_qc/_map_QC.png` — color-coded artifact map (same resolution as tissue mask)\n", + "- `overlays_qc/_overlay_QC.jpg` — artifact map overlaid on a downscaled slide thumbnail\n", + "- `mask_qc/_mask.png` — raw integer mask (class 1–7 per pixel)\n", + "- `geojson_qc/.geojson` — polygon annotations compatible with QuPath and SLIM\n", + "- `report_qc_output_0_N_stats_per_slide.txt` — TSV with per-slide dimensions and processing time\n", + "\n", + "> **GPU note**: On a T4 GPU (Colab free tier), expect ~30–45 s per slide. On CPU only, plan for 10–30 min per slide depending on size." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Running main.py (artifact segmentation) ...\")\n", + "\n", + "result = subprocess.run(\n", + " [sys.executable, \"main.py\",\n", + " \"--slide_folder\", slide_dir_abs,\n", + " \"--output_dir\", output_dir_abs,\n", + " \"--mpp_model\", \"1.5\",\n", + " \"--create_geojson\", \"Y\"],\n", + " cwd=scripts_abs,\n", + ")\n", + "\n", + "if result.returncode != 0:\n", + " print(f\"\\nERROR: main.py exited with code {result.returncode}\")\n", + "else:\n", + " maps_dir = os.path.join(output_dir_abs, \"maps_qc\")\n", + " maps = sorted(os.listdir(maps_dir))\n", + " print(f\"\\nArtifact segmentation complete — {len(maps)} map(s) in {maps_dir}/\")\n", + " for m in maps:\n", + " print(f\" {m}\")\n", + "\n", + " # Locate the TSV report (name includes the slide count)\n", + " reports = [f for f in os.listdir(output_dir_abs) if f.endswith(\"_stats_per_slide.txt\")]\n", + " if reports:\n", + " import pandas as pd\n", + " report_path = os.path.join(output_dir_abs, reports[0])\n", + " df = pd.read_csv(report_path, sep=\"\\t\")\n", + " print(f\"\\nPer-slide report ({reports[0]}):\")\n", + " display(df[[\"slide_name\", \"mpp\", \"height\", \"width\", \"time\"]])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 5: Visualize and Analyze Results\n", + "\n", + "GrandQC writes several output types per slide. We look at three of them:\n", + "\n", + "1. **Tissue detection overlays** — confirms that tissue vs. background separation was correct\n", + "2. **Artifact segmentation maps** — color-coded composite of the 7-class artifact model overlaid on a slide thumbnail\n", + "3. **Raw integer masks** (`mask_qc/`) — pixel-level class labels (1–7) used to compute per-slide artifact fractions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 5.1 Tissue Detection Overlays\n", + "\n", + "Each overlay blends the original thumbnail (70 %) with a two-class color mask (30 %): blue = tissue, gray = background." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import matplotlib.image as mpimg\n", + "\n", + "overlay_dir = os.path.join(OUTPUT_DIR, \"tis_det_overlay\")\n", + "fnames = sorted(os.listdir(overlay_dir))\n", + "\n", + "fig, axes = plt.subplots(1, len(fnames), figsize=(6 * len(fnames), 5))\n", + "if len(fnames) == 1:\n", + " axes = [axes]\n", + "for ax, fname in zip(axes, fnames):\n", + " ax.imshow(mpimg.imread(os.path.join(overlay_dir, fname)))\n", + " ax.set_title(fname.replace(\"_OVERLAY.jpg\", \"\"), fontsize=7)\n", + " ax.axis(\"off\")\n", + "plt.suptitle(\"Step 1 — Tissue Detection\", fontsize=11, fontweight=\"bold\")\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 5.2 Artifact Segmentation Maps\n", + "\n", + "The QC overlay composites the artifact color map onto a downscaled thumbnail of the original slide. Colors match the `wsi_colors.py` palette used by GrandQC:\n", + "\n", + "| Class | Color | Artifact |\n", + "|-------|-------|---------|\n", + "| 1 | ■ Gray `#808080` | Clean tissue |\n", + "| 2 | ■ Tomato `#FF6347` | Folds |\n", + "| 3 | ■ Lime `#00FF00` | Dark spots |\n", + "| 4 | ■ Red `#FF0000` | Pen marks |\n", + "| 5 | ■ Magenta `#FF00FF` | Air bubbles / slide edges |\n", + "| 6 | ■ Indigo `#4B0082` | Out-of-focus |\n", + "| 7 | □ White `#FFFFFF` | Background |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "overlay_qc_dir = os.path.join(OUTPUT_DIR, \"overlays_qc\")\n", + "fnames_qc = sorted(os.listdir(overlay_qc_dir))\n", + "\n", + "fig, axes = plt.subplots(1, len(fnames_qc), figsize=(6 * len(fnames_qc), 5))\n", + "if len(fnames_qc) == 1:\n", + " axes = [axes]\n", + "for ax, fname in zip(axes, fnames_qc):\n", + " ax.imshow(mpimg.imread(os.path.join(overlay_qc_dir, fname)))\n", + " ax.set_title(fname.replace(\"_overlay_QC.jpg\", \"\"), fontsize=7)\n", + " ax.axis(\"off\")\n", + "plt.suptitle(\"Step 2 — Artifact Segmentation Overlays\", fontsize=11, fontweight=\"bold\")\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 5.3 Artifact Fractions per Slide\n", + "\n", + "We read the raw integer masks from `mask_qc/` (pixel values 1–7) and compute what fraction of the **tissue area** (excluding background, class 7) belongs to each artifact class. This is the standard metric reported in the GrandQC paper." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from PIL import Image\n", + "\n", + "# GrandQC wsi_colors.py palette — RGB, indices 0-5 correspond to classes 1-6\n", + "GRANDQC_COLORS_RGB = [\n", + " [128, 128, 128], # 1 Clean tissue\n", + " [255, 99, 71], # 2 Folds\n", + " [ 0, 255, 0], # 3 Dark spots\n", + " [255, 0, 0], # 4 Pen marks\n", + " [255, 0, 255], # 5 Bubbles/edges\n", + " [ 75, 0, 130], # 6 Out-of-focus\n", + "]\n", + "CLASS_NAMES_SHORT = [\n", + " \"Clean tissue\", \"Folds\", \"Dark spots\",\n", + " \"Pen marks\", \"Bubbles/edges\", \"Out-of-focus\",\n", + "]\n", + "# Normalize to [0,1] for matplotlib\n", + "mpl_colors = [[r/255, g/255, b/255] for r, g, b in GRANDQC_COLORS_RGB]\n", + "\n", + "mask_dir = os.path.join(OUTPUT_DIR, \"mask_qc\")\n", + "rows = []\n", + "for fname in sorted(os.listdir(mask_dir)):\n", + " barcode = fname.replace(\"_mask.png\", \"\")\n", + " mask = np.array(Image.open(os.path.join(mask_dir, fname)))\n", + " tissue_px = np.sum(mask != 7) # exclude background\n", + " if tissue_px == 0:\n", + " continue\n", + " row = {\"slide\": barcode}\n", + " for cls_idx, name in enumerate(CLASS_NAMES_SHORT, start=1):\n", + " row[name] = np.sum(mask == cls_idx) / tissue_px * 100\n", + " rows.append(row)\n", + "\n", + "stats_df = pd.DataFrame(rows).set_index(\"slide\")\n", + "print(\"Artifact class fractions (% of tissue area):\")\n", + "display(stats_df.round(1))\n", + "\n", + "# ── Stacked bar chart ────────────────────────────────────────────────────────\n", + "ax = stats_df[CLASS_NAMES_SHORT].plot(\n", + " kind=\"bar\",\n", + " stacked=True,\n", + " color=mpl_colors,\n", + " figsize=(max(6, 2.5 * len(stats_df)), 5),\n", + " edgecolor=\"none\",\n", + ")\n", + "ax.set_ylabel(\"% of tissue area\")\n", + "ax.set_xlabel(\"\")\n", + "ax.set_title(\"GrandQC Artifact Fractions per Slide\", fontweight=\"bold\")\n", + "ax.set_xticklabels(ax.get_xticklabels(), rotation=25, ha=\"right\", fontsize=8)\n", + "ax.legend(loc=\"upper right\", fontsize=8, framealpha=0.8)\n", + "ax.set_ylim(0, 110)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 5.4 View Original Slides in IDC SLIM Viewer\n", + "\n", + "The IDC SLIM viewer is a web-based pathology viewer for DICOM whole slide images. The links below open each slide at full resolution. The GeoJSON annotations produced by GrandQC (`geojson_qc/`) can be loaded into QuPath or other tools that accept GeoJSON overlays for interactive artifact review." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame, display\n", + "\n", + "print(\"IDC SLIM viewer links (open in browser for full-resolution viewing):\\n\")\n", + "for _, row in demo_slides.iterrows():\n", + " url = idc_client.get_viewer_URL(seriesInstanceUID=row[\"SeriesInstanceUID\"])\n", + " print(f\" {row['ContainerIdentifier']}\")\n", + " print(f\" {url}\\n\")\n", + "\n", + "# Embed the first slide inline (750 px height)\n", + "first_url = idc_client.get_viewer_URL(seriesInstanceUID=demo_slides.iloc[0][\"SeriesInstanceUID\"])\n", + "print(f\"Embedding: {demo_slides.iloc[0]['ContainerIdentifier']}\")\n", + "display(IFrame(first_url, width=\"100%\", height=750))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 6: Validating the Pipeline Against Pre-computed TCGA Masks\n", + "\n", + "In Parts 3–5 we ran GrandQC locally on IDC DICOM slides using OpenSlide's native DICOM reader — a path the GrandQC authors did not test in their original publication. Before using the pipeline on new data, it is good practice to verify that our results are consistent with the authors' reference outputs.\n", + "\n", + "The GrandQC authors published pre-computed QC masks for all 32 TCGA cohorts on Zenodo:\n", + "\n", + "> **GrandQC pre-computed masks**: [zenodo.org/records/14041578](https://zenodo.org/records/14041578) \n", + "> 36 tar archives — one per TCGA cohort (some large cohorts split into two parts) — totalling ~18.3 GB.\n", + "\n", + "Each archive unpacks to PNG files named `{ContainerIdentifier}_mask.png` with integer labels 1–7 — the same format and naming convention produced by `main.py`. By downloading the TCGA-BRCA archive and comparing artifact fractions for our demo slides against the reference, we can confirm that the DICOM-based pipeline is producing equivalent results." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.1 Available Cohort Archives\n", + "\n", + "The Zenodo record contains one tar file per TCGA cohort. Large cohorts (BRCA, THCA, GBM, LUAD, UCEC, LGG, TGCT) are split into two parts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "# Fetch the Zenodo record metadata to get the authoritative file list\n", + "rec = requests.get(\"https://zenodo.org/api/records/14041578\").json()\n", + "files = rec[\"files\"]\n", + "\n", + "rows_z = []\n", + "for f in sorted(files, key=lambda x: x[\"key\"]):\n", + " rows_z.append({\n", + " \"archive\": f[\"key\"],\n", + " \"size_MB\": round(f[\"size\"] / 1024**2, 0),\n", + " })\n", + "\n", + "zenodo_df = pd.DataFrame(rows_z)\n", + "total_gb = zenodo_df[\"size_MB\"].sum() / 1024\n", + "print(f\"{len(zenodo_df)} archives | total: {total_gb:.1f} GB\\n\")\n", + "display(zenodo_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.2 Download the TCGA-BRCA Reference Masks\n", + "\n", + "`BRCA.tar` (~2 GB) contains masks for all TCGA-BRCA slides. After extraction, each file is named `{ContainerIdentifier}_mask.png` — identical to what GrandQC writes into `mask_qc/` locally." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "REF_MASKS_DIR = \"brca_ref_masks\"\n", + "os.makedirs(REF_MASKS_DIR, exist_ok=True)\n", + "\n", + "# Download BRCA.tar (~2 GB) — takes 5-10 min on Colab\n", + "!wget -q --show-progress \\\n", + " \"https://zenodo.org/records/14041578/files/BRCA.tar?download=1\" \\\n", + " -O BRCA.tar\n", + "\n", + "# Extract — PNG files land directly in the archive root\n", + "!tar xf BRCA.tar -C {REF_MASKS_DIR}\n", + "\n", + "# Report what we got\n", + "ref_masks = sorted(f for f in os.listdir(REF_MASKS_DIR) if f.endswith(\"_mask.png\"))\n", + "print(f\"\\nExtracted {len(ref_masks)} reference masks to {REF_MASKS_DIR}/\")\n", + "print(\"Sample filenames:\")\n", + "for f in ref_masks[:5]:\n", + " print(f\" {f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.3 Pipeline Validation: Compare Local vs. Reference Masks\n", + "\n", + "The reference masks in `brca_ref_masks/` were produced by the GrandQC authors running on TIFF/SVS files. Our local masks in `qc_output/mask_qc/` were produced by the same model running on IDC DICOM files via OpenSlide's native DICOM reader.\n", + "\n", + "We compare per-class artifact fractions for each demo slide. Close agreement (< 2 percentage points per class) confirms that the DICOM-based path does not introduce bias relative to the authors' original pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def mask_fractions(mask_path):\n", + " mask = np.array(Image.open(mask_path))\n", + " tissue_px = np.sum(mask != 7)\n", + " if tissue_px == 0:\n", + " return None\n", + " return {name: np.sum(mask == cls) / tissue_px * 100\n", + " for cls, name in enumerate(CLASS_NAMES_SHORT, start=1)}\n", + "\n", + "\n", + "# Build comparison table for each demo slide\n", + "compare_rows = []\n", + "for barcode in uid_to_barcode.values():\n", + " local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n", + " ref_path = os.path.join(REF_MASKS_DIR, f\"{barcode}_mask.png\")\n", + "\n", + " if not os.path.exists(local_path):\n", + " print(f\" MISSING local mask: {barcode}\")\n", + " continue\n", + " if not os.path.exists(ref_path):\n", + " print(f\" MISSING reference mask: {barcode} — slide may not be in BRCA.tar\")\n", + " continue\n", + "\n", + " local_frac = mask_fractions(local_path)\n", + " ref_frac = mask_fractions(ref_path)\n", + " if local_frac is None or ref_frac is None:\n", + " continue\n", + "\n", + " for name in CLASS_NAMES_SHORT:\n", + " compare_rows.append({\n", + " \"slide\": barcode,\n", + " \"class\": name,\n", + " \"local_%\": round(local_frac[name], 2),\n", + " \"ref_%\": round(ref_frac[name], 2),\n", + " \"delta_pp\": round(local_frac[name] - ref_frac[name], 2),\n", + " })\n", + "\n", + "compare_df = pd.DataFrame(compare_rows)\n", + "print(\"Per-class fraction comparison (local DICOM run vs. Zenodo reference):\\n\")\n", + "display(compare_df.pivot_table(index=\"slide\", columns=\"class\", values=\"delta_pp\").round(2))\n", + "print(\"\\nδ = local − reference (percentage points). Values near 0 confirm agreement.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Scatter plot: local fraction vs. reference fraction for all classes × slides\n", + "fig, ax = plt.subplots(figsize=(5, 5))\n", + "\n", + "class_colors = {name: mpl_colors[i] for i, name in enumerate(CLASS_NAMES_SHORT)}\n", + "\n", + "for cls_name, grp in compare_df.groupby(\"class\"):\n", + " ax.scatter(grp[\"ref_%\"], grp[\"local_%\"],\n", + " label=cls_name, color=class_colors[cls_name],\n", + " s=60, edgecolors=\"black\", linewidths=0.4, zorder=3)\n", + "\n", + "# Identity line\n", + "lim_max = max(compare_df[[\"local_%\", \"ref_%\"]].max()) * 1.05\n", + "ax.plot([0, lim_max], [0, lim_max], \"k--\", linewidth=0.8, label=\"y = x\")\n", + "ax.set_xlabel(\"Reference fraction (%) — Zenodo\")\n", + "ax.set_ylabel(\"Local fraction (%) — DICOM run\")\n", + "ax.set_title(\"Pipeline Validation: Local vs. Reference Artifact Fractions\",\n", + " fontweight=\"bold\", fontsize=9)\n", + "ax.legend(fontsize=7, framealpha=0.8)\n", + "ax.set_xlim(0, lim_max)\n", + "ax.set_ylim(0, lim_max)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.4 Cohort-level QC Statistics\n", + "\n", + "With validation confirmed, we compute per-slide artifact fractions across all TCGA-BRCA slides in the reference archive. This gives a population-level view of slide quality without running any inference locally — each mask is read once and discarded, keeping memory usage constant." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from tqdm.notebook import tqdm\n", + "\n", + "cohort_rows = []\n", + "for fname in tqdm(ref_masks, desc=\"Computing stats\"):\n", + " barcode = fname.replace(\"_mask.png\", \"\")\n", + " fracs = mask_fractions(os.path.join(REF_MASKS_DIR, fname))\n", + " if fracs is not None:\n", + " fracs[\"slide\"] = barcode\n", + " cohort_rows.append(fracs)\n", + "\n", + "cohort_df = pd.DataFrame(cohort_rows).set_index(\"slide\")\n", + "\n", + "print(f\"Cohort: TCGA-BRCA | {len(cohort_df)} slides\\n\")\n", + "print(\"Median artifact fractions (% of tissue area):\")\n", + "display(cohort_df[CLASS_NAMES_SHORT].median().rename(\"median_%\").to_frame().T.round(1))\n", + "print()\n", + "print(\"90th-percentile artifact fractions:\")\n", + "display(cohort_df[CLASS_NAMES_SHORT].quantile(0.9).rename(\"p90_%\").to_frame().T.round(1))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Violin plot of per-slide artifact fractions across TCGA-BRCA\n", + "# Exclude Clean tissue (class 1) — typically >80%, dominates the y-axis\n", + "artifact_classes = [c for c in CLASS_NAMES_SHORT if c != \"Clean tissue\"]\n", + "artifact_colors = [mpl_colors[i] for i, c in enumerate(CLASS_NAMES_SHORT) if c != \"Clean tissue\"]\n", + "\n", + "fig, ax = plt.subplots(figsize=(9, 4))\n", + "data = [cohort_df[cls].dropna().values for cls in artifact_classes]\n", + "parts = ax.violinplot(data, positions=range(len(artifact_classes)),\n", + " showmedians=True, showextrema=False)\n", + "\n", + "for pc, color in zip(parts[\"bodies\"], artifact_colors):\n", + " pc.set_facecolor(color)\n", + " pc.set_alpha(0.75)\n", + "parts[\"cmedians\"].set_color(\"black\")\n", + "parts[\"cmedians\"].set_linewidth(1.5)\n", + "\n", + "ax.set_xticks(range(len(artifact_classes)))\n", + "ax.set_xticklabels(artifact_classes, rotation=20, ha=\"right\")\n", + "ax.set_ylabel(\"% of tissue area\")\n", + "ax.set_title(f\"Artifact fraction distribution — TCGA-BRCA ({len(cohort_df)} slides)\",\n", + " fontweight=\"bold\")\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Join QC stats with IDC sm_index metadata via ContainerIdentifier\n", + "cohort_meta = cohort_df.reset_index().rename(columns={\"slide\": \"ContainerIdentifier\"})\n", + "brca_meta = tcga_brca_he[[\"ContainerIdentifier\", \"tissue_type\"]].dropna(subset=[\"tissue_type\"])\n", + "\n", + "merged = cohort_meta.merge(brca_meta, on=\"ContainerIdentifier\", how=\"inner\")\n", + "\n", + "print(f\"Slides with tissue-type annotation: {len(merged)} / {len(cohort_df)}\\n\")\n", + "\n", + "# Median artifact fractions by tissue type\n", + "by_tissue = (\n", + " merged.groupby(\"tissue_type\")[CLASS_NAMES_SHORT]\n", + " .median()\n", + " .round(1)\n", + ")\n", + "print(\"Median artifact fractions (%) by tissue type:\")\n", + "display(by_tissue)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}