Integration with slide-level labels

Integration with slide-level labels#

In this tutorial we will demonstrate how to integrate whole slide images (WSIs) with slide-level labels and derive quantitative scores for each slide via top-K scoring.

We will also demonstrate how to run tasks in a distributed fashion using dask.

For this, we will be using a pre-processed dataset of artery tissue from GTEx, which contains healthy and calcified samples.

from huggingface_hub import hf_hub_download
import pandas as pd

table = hf_hub_download(
    "rendeirolab/lazyslide-data", "GTEx_artery_dataset.csv.gz", repo_type="dataset"
)

dataset = pd.read_csv(table)
dataset.head()

	Tissue Sample Id	Sex	Age Bracket	Pathology Categories
0	GTEX-111YS-2226	male	60-69	calcification
1	GTEX-11GSP-2926	female	60-69	calcification
2	GTEX-11LCK-1426	male	30-39	clean_specimens
3	GTEX-11ONC-2726	male	60-69	calcification
4	GTEX-12126-0726	male	20-29	clean_specimens

Here I prepared a set of terms to perform semantic analysis on WSIs. Using a curated set of semantic terms helps make the output of vision-text models more consistent and easier to interpret.

terms = [
    "BMP-2",
    "Monckeberg sclerosis",
    "Runx2",
    "adventitia",
    "apoptosis",
    "arterial hardening",
    "arterial narrowing",
    "arterial remodeling",
    "arterial stiffness",
    "arteriole",
    "artery",
    "atherosclerosis",
    "basement membrane",
    "blood flow",
    "bone morphogenetic protein",
    "calcification",
    "calcified nodule",
    "calcium deposition",
    "calcium phosphate",
    "chronic kidney disease",
    "collagen",
    "compliance",
    "connective tissue",
    "elastic fibers",
    "elasticity",
    "endothelial dysfunction",
    "endothelium",
    "epithelium",
    "external elastic lamina",
    "extracellular matrix",
    "fibroblast",
    "fibrosis",
    "fibrous cap",
    "gap junction",
    "hemodynamics",
    "hydroxyapatite",
    "hyperphosphatemia",
    "inflammation",
    "internal elastic lamina",
    "interstitial space",
    "intima",
    "intimal calcification",
    "intimal thickening",
    "ischemia",
    "lamina propria",
    "lumen",
    "macrocalcification",
    "macrophage",
    "matrix vesicle",
    "mechanotransduction",
    "media",
    "medial calcification",
    "microcalcification",
    "mineralization",
    "myofibroblast",
    "necrotic core",
    "osteoblast-like cell",
    "osteocalcin",
    "osteogenic",
    "osteopontin",
    "oxidative stress",
    "pericyte",
    "phosphate transporter",
    "plaque",
    "shear stress",
    "smooth muscle",
    "tight junction",
    "tunica",
    "vasa vasorum",
    "vascular basement membrane",
    "vascular compliance",
    "vascular integrity",
    "vascular niche",
    "vascular ossification",
    "vascular remodeling",
    "vascular smooth muscle cell",
    "vascular stiffness",
    "vascular tone",
    "vascular wall",
]

Since we need to run for many slides, let’s first define a function to process a slide and reuse it.

from wsidata import open_wsi
import lazyslide as zs


def wsi_feature_extraction(slide):
    s = hf_hub_download(
        "rendeirolab/lazyslide-data",
        f"gtex_artery_data/{slide}.svs",
        repo_type="dataset",
    )
    wsi = open_wsi(s, attach_thumbnail=False, store="data")
    zs.pp.find_tissues(wsi)
    zs.pp.tile_tissues(wsi, 256, mpp=0.5, background_fraction=0.5)

    # conch feature
    zs.tl.feature_extraction(wsi, "conch", pbar=False)
    zs.tl.feature_aggregation(wsi, "conch")
    embed = zs.tl.text_embedding(terms, "conch")
    zs.tl.text_image_similarity(wsi, embed, "conch")
    wsi.write()

Run for all slides#

The easiest way is to run a for-loop:

for slide in dataset["Tissue Sample Id"]:
    wsi_feature_extraction(slide)

However, this will take a long time and doesn’t fully use the power of parallelization.

Distributed processing with dask#

Dask is a good option for parallelization on local machine or across multiple machines.

For different hardware availabilities, alternatives are:

dask-jobqueue: For PBS, Slurm, MOAB, SGE, LSF, and HTCondor.
coiled: AWS, GCP, Azure etc.
dask-cuda: If you have multiple GPU cards locally.

Here, we showcase how to parallel the jobs with dask on a SLURM cluster. The configuration may not work on your SLURM system, please make adjustment accordingly.

When running GPU-intensive work like feature extraction for multiple WSIs, we recommend to run one task on one GPU every time. To accelerate the processing speed, either distribute across multiple GPU cards or multiple machines.

Here are code snippet to run on different architectures

Run local with CPUs:

from dask.distributed import LocalCluster
cluster = LocalCluster()

Run local with many GPUs:

from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()

Run on a SLURM cluster with GPUs (Example script, may not work on users’ cluster):

from dask_jobqueue import SLURMCluster

cluster = SLURMCluster(
    queue="gpu",
    cores=8,
    processes=1,
    memory="20 GB",
    # For SLURM, use --gres flag to get GPU
    job_extra_directives=["--gres=gpu:h100pcie:1"],
    # Each work must one GPU
    worker_extra_args=["--resources GPU=1"],
)

from dask_jobqueue import SLURMCluster

cluster = SLURMCluster(
    queue="gpu",
    cores=8,
    processes=1,
    memory="20 GB",
    interface="ib1",
    job_extra_directives=["-q gpu", "--gres=gpu:l4_gpu:1", "--time=2:00:00"],
    worker_extra_args=["--resources GPU=1"],
    log_directory="./dask-logs",
)

from dask.distributed import Client

client = Client(cluster)
cluster.adapt(minimum=1, maximum=10)

<distributed.deploy.adaptive.Adaptive at 0x1555114dd700>

client

Client

Client-9365270a-6d51-11f0-bf27-4f153bb806f6

Connection method: Cluster object	Cluster type: dask_jobqueue.SLURMCluster
Dashboard: http://10.110.89.41:8787/status

Cluster Info

SLURMCluster

03026b50

Dashboard: http://10.110.89.41:8787/status	Workers: 0
Total threads: 0	Total memory: 0 B

Scheduler Info

Scheduler

Scheduler-a7c386ef-0280-494d-b545-dba1564844c8

Comm: tcp://10.110.89.41:36261	Workers: 0
Dashboard: http://10.110.89.41:8787/status	Total threads: 0
Started: Just now	Total memory: 0 B

Workers

Let’s parallelize the jobs

futures = [
    client.submit(wsi_feature_extraction, slide, resources={"GPU": 1})
    for slide in dataset["Tissue Sample Id"]
]

If you want to monitor the process, you can either go to the dask dashboard or use a simple progress bar

from dask.distributed import as_completed
from tqdm.auto import tqdm

for _ in tqdm(as_completed(futures), total=len(futures)):
    pass

client.shutdown()

We can calculate the scores for all pathological terms that we defined and save them for further analysis.

from pathlib import Path
from anndata import read_zarr

slide_scores = {}
for store in Path("data").glob("*.zarr"):
    adata = read_zarr(store / "tables" / "conch_tiles_text_similarity")
    scores = zs.metrics.topk_score(adata, k=100)
    slide_scores[store.stem] = dict(zip(adata.var.index, scores))

slide_scores = pd.DataFrame(slide_scores).T

Slide aggregation#

After the slides are processed to have slide-level features and scores, we will aggregate them into an AnnData object.

from wsidata import agg_wsi

dataset["store"] = [f"data/{s}.zarr" for s in dataset["Tissue Sample Id"]]
agg_data = agg_wsi(dataset, "conch", store_col="store", agg_key="agg_slide")
agg_data.obs = agg_data.obs.join(slide_scores, on="Tissue Sample Id")
agg_data

AnnData object with n_obs × n_vars = 45 × 512
    obs: 'Tissue Sample Id', 'Sex', 'Age Bracket', 'Pathology Categories', 'store', 'BMP-2', 'Monckeberg sclerosis', 'Runx2', 'adventitia', 'apoptosis', 'arterial hardening', 'arterial narrowing', 'arterial remodeling', 'arterial stiffness', 'arteriole', 'artery', 'atherosclerosis', 'basement membrane', 'blood flow', 'bone morphogenetic protein', 'calcification', 'calcified nodule', 'calcium deposition', 'calcium phosphate', 'chronic kidney disease', 'collagen', 'compliance', 'connective tissue', 'elastic fibers', 'elasticity', 'endothelial dysfunction', 'endothelium', 'epithelium', 'external elastic lamina', 'extracellular matrix', 'fibroblast', 'fibrosis', 'fibrous cap', 'gap junction', 'hemodynamics', 'hydroxyapatite', 'hyperphosphatemia', 'inflammation', 'internal elastic lamina', 'interstitial space', 'intima', 'intimal calcification', 'intimal thickening', 'ischemia', 'lamina propria', 'lumen', 'macrocalcification', 'macrophage', 'matrix vesicle', 'mechanotransduction', 'media', 'medial calcification', 'microcalcification', 'mineralization', 'myofibroblast', 'necrotic core', 'osteoblast-like cell', 'osteocalcin', 'osteogenic', 'osteopontin', 'oxidative stress', 'pericyte', 'phosphate transporter', 'plaque', 'shear stress', 'smooth muscle', 'tight junction', 'tunica', 'vasa vasorum', 'vascular basement membrane', 'vascular compliance', 'vascular integrity', 'vascular niche', 'vascular ossification', 'vascular remodeling', 'vascular smooth muscle cell', 'vascular stiffness', 'vascular tone', 'vascular wall'

agg_data.write_h5ad("agg_conch_features.h5ad")