Integration with slide-level labels#
In this tutorial we will demonstrate how to integrate whole slide images (WSIs) with slide-level labels and derive quantitative scores for each slide via top-K scoring.
We will also demonstrate how to run tasks in a distributed fashion using dask.
For this, we will be using a pre-processed dataset of artery tissue from GTEx, which contains healthy and calcified samples.
from huggingface_hub import hf_hub_download
import pandas as pd
table = hf_hub_download(
"rendeirolab/lazyslide-data", "GTEx_artery_dataset.csv.gz", repo_type="dataset"
)
dataset = pd.read_csv(table)
dataset.head()
| Tissue Sample Id | Sex | Age Bracket | Pathology Categories | |
|---|---|---|---|---|
| 0 | GTEX-111YS-2226 | male | 60-69 | calcification |
| 1 | GTEX-11GSP-2926 | female | 60-69 | calcification |
| 2 | GTEX-11LCK-1426 | male | 30-39 | clean_specimens |
| 3 | GTEX-11ONC-2726 | male | 60-69 | calcification |
| 4 | GTEX-12126-0726 | male | 20-29 | clean_specimens |
Here I prepared a set of terms to perform semantic analysis on WSIs. Using a curated set of semantic terms helps make the output of vision-text models more consistent and easier to interpret.
terms = [
"BMP-2",
"Monckeberg sclerosis",
"Runx2",
"adventitia",
"apoptosis",
"arterial hardening",
"arterial narrowing",
"arterial remodeling",
"arterial stiffness",
"arteriole",
"artery",
"atherosclerosis",
"basement membrane",
"blood flow",
"bone morphogenetic protein",
"calcification",
"calcified nodule",
"calcium deposition",
"calcium phosphate",
"chronic kidney disease",
"collagen",
"compliance",
"connective tissue",
"elastic fibers",
"elasticity",
"endothelial dysfunction",
"endothelium",
"epithelium",
"external elastic lamina",
"extracellular matrix",
"fibroblast",
"fibrosis",
"fibrous cap",
"gap junction",
"hemodynamics",
"hydroxyapatite",
"hyperphosphatemia",
"inflammation",
"internal elastic lamina",
"interstitial space",
"intima",
"intimal calcification",
"intimal thickening",
"ischemia",
"lamina propria",
"lumen",
"macrocalcification",
"macrophage",
"matrix vesicle",
"mechanotransduction",
"media",
"medial calcification",
"microcalcification",
"mineralization",
"myofibroblast",
"necrotic core",
"osteoblast-like cell",
"osteocalcin",
"osteogenic",
"osteopontin",
"oxidative stress",
"pericyte",
"phosphate transporter",
"plaque",
"shear stress",
"smooth muscle",
"tight junction",
"tunica",
"vasa vasorum",
"vascular basement membrane",
"vascular compliance",
"vascular integrity",
"vascular niche",
"vascular ossification",
"vascular remodeling",
"vascular smooth muscle cell",
"vascular stiffness",
"vascular tone",
"vascular wall",
]
Since we need to run for many slides, let’s first define a function to process a slide and reuse it.
from wsidata import open_wsi
import lazyslide as zs
def wsi_feature_extraction(slide):
s = hf_hub_download(
"rendeirolab/lazyslide-data",
f"gtex_artery_data/{slide}.svs",
repo_type="dataset",
)
wsi = open_wsi(s, attach_thumbnail=False, store="data")
zs.pp.find_tissues(wsi)
zs.pp.tile_tissues(wsi, 256, mpp=0.5, background_fraction=0.5)
# conch feature
zs.tl.feature_extraction(wsi, "conch", pbar=False)
zs.tl.feature_aggregation(wsi, "conch")
embed = zs.tl.text_embedding(terms, "conch")
zs.tl.text_image_similarity(wsi, embed, "conch")
wsi.write()
Run for all slides#
The easiest way is to run a for-loop:
for slide in dataset["Tissue Sample Id"]:
wsi_feature_extraction(slide)
However, this will take a long time and doesn’t fully use the power of parallelization.
Distributed processing with dask#
Dask is a good option for parallelization on local machine or across multiple machines.
For different hardware availabilities, alternatives are:
dask-jobqueue: For PBS, Slurm, MOAB, SGE, LSF, and HTCondor.
coiled: AWS, GCP, Azure etc.
dask-cuda: If you have multiple GPU cards locally.
Here, we showcase how to parallel the jobs with dask on a SLURM cluster. The configuration may not work on your SLURM system, please make adjustment accordingly.
When running GPU-intensive work like feature extraction for multiple WSIs, we recommend to run one task on one GPU every time. To accelerate the processing speed, either distribute across multiple GPU cards or multiple machines.
Here are code snippet to run on different architectures
Run local with CPUs:
from dask.distributed import LocalCluster
cluster = LocalCluster()
Run local with many GPUs:
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()
Run on a SLURM cluster with GPUs (Example script, may not work on users’ cluster):
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue="gpu",
cores=8,
processes=1,
memory="20 GB",
# For SLURM, use --gres flag to get GPU
job_extra_directives=["--gres=gpu:h100pcie:1"],
# Each work must one GPU
worker_extra_args=["--resources GPU=1"],
)
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue="gpu",
cores=8,
processes=1,
memory="20 GB",
interface="ib1",
job_extra_directives=["-q gpu", "--gres=gpu:l4_gpu:1", "--time=2:00:00"],
worker_extra_args=["--resources GPU=1"],
log_directory="./dask-logs",
)
from dask.distributed import Client
client = Client(cluster)
cluster.adapt(minimum=1, maximum=10)
<distributed.deploy.adaptive.Adaptive at 0x1555114dd700>
client
Client
Client-9365270a-6d51-11f0-bf27-4f153bb806f6
| Connection method: Cluster object | Cluster type: dask_jobqueue.SLURMCluster |
| Dashboard: http://10.110.89.41:8787/status |
Cluster Info
SLURMCluster
03026b50
| Dashboard: http://10.110.89.41:8787/status | Workers: 0 |
| Total threads: 0 | Total memory: 0 B |
Scheduler Info
Scheduler
Scheduler-a7c386ef-0280-494d-b545-dba1564844c8
| Comm: tcp://10.110.89.41:36261 | Workers: 0 |
| Dashboard: http://10.110.89.41:8787/status | Total threads: 0 |
| Started: Just now | Total memory: 0 B |
Workers
Let’s parallelize the jobs
futures = [
client.submit(wsi_feature_extraction, slide, resources={"GPU": 1})
for slide in dataset["Tissue Sample Id"]
]
If you want to monitor the process, you can either go to the dask dashboard or use a simple progress bar
from dask.distributed import as_completed
from tqdm.auto import tqdm
for _ in tqdm(as_completed(futures), total=len(futures)):
pass
client.shutdown()
We can calculate the scores for all pathological terms that we defined and save them for further analysis.
from pathlib import Path
from anndata import read_zarr
slide_scores = {}
for store in Path("data").glob("*.zarr"):
adata = read_zarr(store / "tables" / "conch_tiles_text_similarity")
scores = zs.metrics.topk_score(adata, k=100)
slide_scores[store.stem] = dict(zip(adata.var.index, scores))
slide_scores = pd.DataFrame(slide_scores).T
Slide aggregation#
After the slides are processed to have slide-level features and scores, we will aggregate them into an AnnData object.
from wsidata import agg_wsi
dataset["store"] = [f"data/{s}.zarr" for s in dataset["Tissue Sample Id"]]
agg_data = agg_wsi(dataset, "conch", store_col="store", agg_key="agg_slide")
agg_data.obs = agg_data.obs.join(slide_scores, on="Tissue Sample Id")
agg_data
AnnData object with n_obs × n_vars = 45 × 512
obs: 'Tissue Sample Id', 'Sex', 'Age Bracket', 'Pathology Categories', 'store', 'BMP-2', 'Monckeberg sclerosis', 'Runx2', 'adventitia', 'apoptosis', 'arterial hardening', 'arterial narrowing', 'arterial remodeling', 'arterial stiffness', 'arteriole', 'artery', 'atherosclerosis', 'basement membrane', 'blood flow', 'bone morphogenetic protein', 'calcification', 'calcified nodule', 'calcium deposition', 'calcium phosphate', 'chronic kidney disease', 'collagen', 'compliance', 'connective tissue', 'elastic fibers', 'elasticity', 'endothelial dysfunction', 'endothelium', 'epithelium', 'external elastic lamina', 'extracellular matrix', 'fibroblast', 'fibrosis', 'fibrous cap', 'gap junction', 'hemodynamics', 'hydroxyapatite', 'hyperphosphatemia', 'inflammation', 'internal elastic lamina', 'interstitial space', 'intima', 'intimal calcification', 'intimal thickening', 'ischemia', 'lamina propria', 'lumen', 'macrocalcification', 'macrophage', 'matrix vesicle', 'mechanotransduction', 'media', 'medial calcification', 'microcalcification', 'mineralization', 'myofibroblast', 'necrotic core', 'osteoblast-like cell', 'osteocalcin', 'osteogenic', 'osteopontin', 'oxidative stress', 'pericyte', 'phosphate transporter', 'plaque', 'shear stress', 'smooth muscle', 'tight junction', 'tunica', 'vasa vasorum', 'vascular basement membrane', 'vascular compliance', 'vascular integrity', 'vascular niche', 'vascular ossification', 'vascular remodeling', 'vascular smooth muscle cell', 'vascular stiffness', 'vascular tone', 'vascular wall'
agg_data.write_h5ad("agg_conch_features.h5ad")