%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

import warnings

warnings.filterwarnings("ignore")

Preprocessing a WSI: From raw image to analysis-ready data#

Welcome to the second tutorial in our LazySlide series! In this lesson, we’ll walk you through a complete workflow for preprocessing whole slide images (WSIs). Preprocessing is a crucial step that transforms raw WSI data into a format that’s ready for advanced analysis.

By the end of this tutorial, you’ll understand how to:

  • Load and examine WSI data

  • Segment tissue regions from the background

  • Evaluate tissue quality

  • Create tiles (smaller patches) from large WSIs

  • Extract meaningful features from these tiles

Let’s get started with this essential foundation for computational pathology!

Setting up environment#

Let’s start by importing the LazySlide library. Throughout all our tutorials, we’ll use the convention of importing LazySlide as zs - this makes our code consistent and easy to read. This naming convention is similar to how other popular libraries like pandas (pd) and numpy (np) are typically imported.

import lazyslide as zs

Step 1: Opening a whole slide image#

Our first task is to load a whole slide image for processing. For this tutorial, we’ll download a sample artery tissue slide from the GTEx project. This is the same slide we explored in the previous tutorial, so you should already be familiar with its basic structure.

import pooch

slide = pooch.retrieve(
    "https://brd.nci.nih.gov/brd/imagedownload/GTEX-1117F-0526",
    fname="GTEX-1117F-0526.svs",
    path=".",
    known_hash="sha256:222ab7f2bb42dcd0bcfaccd910cb13be452b453499e6117ab553aa6cd60a135e",
)

Step 2: Understanding the WSIData object#

A key concept in LazySlide is the WSIData object, which is the foundation of our analysis workflow. Let’s take a moment to understand what makes this object so powerful:

  • WSIData extends the SpatialData framework, adding specialized capabilities for working with whole slide images

  • It provides a unified interface for accessing both the image data and associated metadata regardless of your WSI format

  • It’s compatible with other packages in the scverse ecosystem, allowing for seamless integration with single-cell analysis tools

  • It maintains a connection to your original WSI file while storing analysis results separately

When you open a WSI, the system automatically creates a WSIData object. By default, this object is stored in a directory alongside your original WSI file, making it easy to find and reuse your analysis results later.

If you’re interested in learning more about the technical details, you can explore the WSIData documentation.

Now, let’s create our WSIData object:

from wsidata import open_wsi

wsi = open_wsi("GTEX-1117F-0526.svs")

Step 3: Exploring WSIData object#

Let’s examine what’s inside our newly created WSIData object. This will help us understand the structure and components that we’ll be working with throughout our analysis.

wsi
WSI: GTEX-1117F-0526.svs
Reader: openslide
Dimensions: 19958×19919 (h×w), 3 Pyramids
Pixel physical size: 0.49 MPP (20X)
SpatialData object, with associated Zarr store: /Users/yzheng/Projects/LazySlide/docs/source/tutorials/GTEX-1117F-0526.zarr
├── Images
│     └── 'wsi_thumbnail': DataArray[cyx] (3, 1996, 1992)
├── Shapes
│     ├── 'dl-tissues': GeoDataFrame shape: (3, 2) (2D shapes)
│     ├── 'tile_stride_200': GeoDataFrame shape: (414, 3) (2D shapes)
│     ├── 'tiles': GeoDataFrame shape: (253, 5) (2D shapes)
│     └── 'tissues': GeoDataFrame shape: (2, 53) (2D shapes)
└── Tables
      ├── 'h0-mini_tiles': AnnData (253, 1536)
      └── 'resnet50_tiles': AnnData (253, 2048)
with coordinate systems:
    ▸ 'global', with elements:
        wsi_thumbnail (Images), dl-tissues (Shapes), tile_stride_200 (Shapes), tiles (Shapes), tissues (Shapes)

Now, let’s examine the important metadata of our slide. This information tells us crucial details about how the slide was scanned and its dimensions. For example, we can see that this slide was scanned at 20X magnification with a resolution of 0.4942 microns per pixel (mpp). We can also see that the full-resolution image has dimensions of 19958×19919 pixels - that’s nearly 400 million pixels in total!

wsi.properties

Slide Properties

FieldValue
shape[19958, 19919]
n_level3
level_shape[[19958, 19919], [4989, 4979], [2494, 2489]]
level_downsample[1.0, 4.000501706284455, 8.002609074152414]
mpp0.4942
magnification20.0
bounds[0, 0, 19919, 19958]

Let’s visualize our slide to get a better understanding of what we’re working with. LazySlide makes it easy to render a whole slide image with just a single line of code using the tissue function from the plotting module.

zs.pl.tissue(wsi)

Saving Our Work#

An important aspect of working with WSIData is that all changes are initially stored in memory. To preserve our work between sessions, we need to save it to disk. This can be done easily with the .write() method. By default, this saves our WSIData in a directory right next to our original slide file, making it easy to find later.

wsi.write()

Step 4: Tissue Segmentation#

One of the most fundamental steps in WSI analysis is tissue segmentation - the process of identifying and isolating the actual tissue regions from the background and artifacts. This step is crucial because:

  1. It reduces the computational burden by focusing only on relevant areas

  2. It eliminates background areas that could introduce noise into our analysis

  3. It allows us to work with distinct tissue pieces separately

LazySlide provides powerful tools to automatically detect tissue regions. Let’s see how this works:

zs.pp.find_tissues(wsi)

Let’s visualize the results of our tissue segmentation to make sure it worked correctly. In the visualization:

  • The green lines show the borders of detected tissue regions

  • The blue areas represent holes or empty spaces within the tissue that will be excluded from analysis

The default parameters of the tissue segmentation algorithm usually work quite well for most slides, but LazySlide offers many options to fine-tune the segmentation if needed.

zs.pl.tissue(wsi)

Examining the segmentation results#

If we examine our WSIData object again, we’ll notice something new: a GeoDataFrame named tissues has been added to the Shapes slot in our SpatialData object. This is where LazySlide stores the geometric information about each tissue region we’ve identified.

wsi
WSI: GTEX-1117F-0526.svs
Reader: openslide
Dimensions: 19958×19919 (h×w), 3 Pyramids
Pixel physical size: 0.49 MPP (20X)
SpatialData object, with associated Zarr store: /Users/yzheng/Projects/LazySlide/docs/source/tutorials/GTEX-1117F-0526.zarr
├── Images
│     └── 'wsi_thumbnail': DataArray[cyx] (3, 1996, 1992)
├── Shapes
│     ├── 'dl-tissues': GeoDataFrame shape: (3, 2) (2D shapes)
│     ├── 'tile_stride_200': GeoDataFrame shape: (414, 3) (2D shapes)
│     ├── 'tiles': GeoDataFrame shape: (253, 5) (2D shapes)
│     └── 'tissues': GeoDataFrame shape: (2, 2) (2D shapes)
└── Tables
      ├── 'h0-mini_tiles': AnnData (253, 1536)
      └── 'resnet50_tiles': AnnData (253, 2048)
with coordinate systems:
    ▸ 'global', with elements:
        wsi_thumbnail (Images), dl-tissues (Shapes), tile_stride_200 (Shapes), tiles (Shapes), tissues (Shapes)

We can access this table using dictionary-style notation with the key “tissues”. Let’s take a look at what information is stored for each tissue region:

Each row in this table represents a distinct tissue piece, and each tissue is assigned a unique tissue_id. This identifier is extremely useful as it allows us to:

  1. Reference specific tissue pieces in our analysis

  2. Track tissue pieces across different processing steps

  3. Apply operations to individual tissues rather than the entire slide

You might have noticed that these tissue_id values were also displayed in our earlier visualization, making it easy to connect what we see visually with the data in our tables.

wsi["tissues"]
tissue_id geometry
0 0 POLYGON ((5345.743 13804.501, 5337.74 13812.50...
1 1 POLYGON ((16029.226 2520.822, 16021.223 2528.8...

Focusing on individual tissue pieces#

One of the advantages of tissue segmentation is the ability to focus on specific regions of interest. LazySlide makes it easy to zoom in on a particular tissue piece by specifying its tissue_id in the visualization function. This is particularly useful when working with slides that contain multiple distinct tissue sections.

zs.pl.tissue(wsi, tissue_id=0)

Don’t forget to save your work!#

Remember that all the processing we’ve done so far (tissue segmentation, etc.) is stored in memory until we explicitly save it. It’s a good practice to save your work periodically, especially after completing important processing steps. This ensures you won’t lose your progress if your session ends unexpectedly.

wsi.write()

Loading previously saved work#

One of the great benefits of saving your work is that you can easily pick up where you left off in a future session. LazySlide automatically looks for existing WSIData when you open a slide, making it seamless to continue your analysis:

wsi = open_wsi("GTEX-1117F-0526.svs")

Advanced topic: Deep learning-based tissue segmentation#

In addition to the traditional image processing techniques we’ve used so far, LazySlide also offers more advanced deep learning-based approaches to tissue segmentation.

Let’s try the deep learning-based segmentation approach:

zs.seg.tissue(wsi, key_added="dl-tissues")
wsi
WSI: GTEX-1117F-0526.svs
Reader: openslide
Dimensions: 19958×19919 (h×w), 3 Pyramids
Pixel physical size: 0.49 MPP (20X)
SpatialData object, with associated Zarr store: /Users/yzheng/Projects/LazySlide/docs/source/tutorials/GTEX-1117F-0526.zarr
├── Images
│     └── 'wsi_thumbnail': DataArray[cyx] (3, 1996, 1992)
├── Shapes
│     ├── 'dl-tissues': GeoDataFrame shape: (3, 2) (2D shapes)
│     ├── 'tile_stride_200': GeoDataFrame shape: (414, 3) (2D shapes)
│     ├── 'tiles': GeoDataFrame shape: (253, 5) (2D shapes)
│     └── 'tissues': GeoDataFrame shape: (2, 2) (2D shapes)
└── Tables
      ├── 'h0-mini_tiles': AnnData (253, 1536)
      └── 'resnet50_tiles': AnnData (253, 2048)
with coordinate systems:
    ▸ 'global', with elements:
        wsi_thumbnail (Images), dl-tissues (Shapes), tile_stride_200 (Shapes), tiles (Shapes), tissues (Shapes)
zs.pl.tissue(wsi, tissue_key="dl-tissues")

Step 5: Tissue quality control and scoring#

Quality control is a critical step in any WSI analysis pipeline. Even with the best scanning equipment, issues can arise that affect image quality, such as:

  • Uneven staining

  • Out-of-focus regions

  • Scanning artifacts

  • Tissue folding or damage

LazySlide provides built-in scoring functions that can automatically evaluate various aspects of tissue quality. These scores can help you identify problematic regions or even exclude low-quality tissue pieces from your analysis. Let’s apply some common quality scorers to our tissue:

zs.pp.score_tissues(wsi, scorers=["brightness", "redness"])

Step 6: Calculating tissue properties#

Beyond just identifying tissue regions, it’s often valuable to calculate various geometric properties of each tissue piece. These properties can provide insights into tissue morphology and can be used as features in downstream analysis.

LazySlide makes it easy to calculate a comprehensive set of geometric properties for each tissue instance, including:

  • Area and perimeter

  • Compactness and roundness

  • Major and minor axis lengths

  • Orientation and eccentricity

Let’s calculate these properties for our tissue pieces:

zs.tl.tissue_props(wsi)
wsi["tissues"]
tissue_id geometry brightness redness area area_filled convex_area solidity convexity axis_major_length ... moment-mu21 moment-mu12 moment-mu03 moment-nu20 moment-nu11 moment-nu02 moment-nu30 moment-nu21 moment-nu12 moment-nu03
0 0 POLYGON ((5345.743 13804.501, 5337.74 13812.50... 188.943809 218.897005 9295453.0 9962045.0 10318845.0 0.900823 1.110096 4348.672363 ... 1.881723e+14 -5.012970e+14 6.182167e+13 0.068301 -0.026006 0.104072 0.000592 0.000601 -0.001600 0.000197
1 1 POLYGON ((16029.226 2520.822, 16021.223 2528.8... 194.979504 223.053437 8684173.0 8684173.0 9086193.5 0.955755 1.046293 4684.261719 ... 6.244954e+14 1.792367e+15 1.164329e+15 0.074724 0.048856 0.123623 -0.002956 0.002810 0.008065 0.005239

2 rows × 53 columns

Step 7: Tiling - Breaking down the WSI into manageable pieces#

We’ve now reached one of the most important preprocessing steps: tiling (also called patching). This process involves dividing the large whole slide image into smaller, manageable pieces that can be processed individually.

Why tiling is essential#

Tiling solves several critical challenges in WSI analysis:

  1. Memory constraints: As we’ve discussed, whole slide images are enormous (often several GB) and cannot fit entirely into memory

  2. Computational efficiency: Working with smaller tiles allows for parallel processing and faster analysis

  3. Feature extraction: Most deep learning models expect inputs of a fixed size, making tiles ideal for feature extraction

  4. Localized analysis: Tiling preserves spatial information while allowing for detailed analysis of specific regions

Harmonization Across Slides#

::{note} When working with multiple slides from different sources, harmonization becomes crucial to account for batch effects. If your slides were scanned at different magnifications, you should specify a consistent microns-per-pixel (mpp) value during tiling to ensure all tiles represent the same physical area, regardless of the original scanning resolution.

Here is a list of mpp value map with magnification:

Magnification MPP (Microns per Pixel)
40× 0.25
20× 0.5
10× 1

LazySlide’s flexible tiling capabilities#

LazySlide offers exceptional flexibility when it comes to tiling, you can:

  • Request tiles at any magnification level, not just the native scanning magnification (Upsampling is not allowed though)

  • Specify any tile size that suits your analysis needs

  • Control the amount of overlap between adjacent tiles

  • Focus tiling on specific tissue regions, avoiding empty background areas

By default, tiles are created without overlapping to avoid redundancy. All the parameters used for tiling are stored in a tile_spec object, which helps maintain consistency and reproducibility in your analysis. Let’s examine what this specification looks like:

zs.pp.tile_tissues(wsi, 256, mpp=0.5)
wsi.tile_spec("tiles")

Tile 1

Tile 2

Tile 3

Tile at: 0.5 mpp
Tile size: 256×256 (h×w)
Stride: 256×256 (0×0 overlap)
Operation size: 259×259, level=0
Base size: 259×259, level=0
Target tissue: 'tissues'

Creating overlapping tiles#

In some analysis scenarios, it’s beneficial to have overlapping tiles. For example:

  • When analyzing tissue structures that might cross tile boundaries

  • When applying algorithms that perform better with context from neighboring regions

  • Segmentation tasks

LazySlide allows you to control the amount of overlap by specifying the overlap (from 0 to 1) or stride size (the distance between the starting points of adjacent tiles). A stride smaller than the tile size creates overlapping tiles:

# If use overlap
zs.pp.tile_tissues(wsi, 256, overlap=0.1, mpp=0.5, key_added="tile_overlap_0.1")
# If use stride_px
zs.pp.tile_tissues(wsi, 256, stride_px=200, mpp=0.5, key_added="tile_stride_200")
wsi.tile_spec("tile_stride_200")

Tile 1

Tile 2

Tile 3

Tile at: 0.5 mpp
Tile size: 256×256 (h×w)
Stride: 200×200 (56×56 overlap)
Operation size: 259×259, level=0
Base size: 259×259, level=0
Target tissue: 'tissues'

The tiling result#

After tiling, LazySlide stores all tile information in a GeoDataFrame (by default named tiles).

Each tile is linked to its parent tissue through the tissue_id column, maintaining the hierarchical relationship between tissues and tiles. Let’s examine the tiles data:

wsi["tiles"]
tile_id tissue_id geometry
0 0 0 POLYGON ((4052 16394, 4052 16653, 3793 16653, ...
1 1 0 POLYGON ((4052 16653, 4052 16912, 3793 16912, ...
2 2 0 POLYGON ((4311 15617, 4311 15876, 4052 15876, ...
3 3 0 POLYGON ((4311 15876, 4311 16135, 4052 16135, ...
4 4 0 POLYGON ((4311 17171, 4311 17430, 4052 17430, ...
... ... ... ...
248 248 1 POLYGON ((17672 5369, 17672 5628, 17413 5628, ...
249 249 1 POLYGON ((17672 5628, 17672 5887, 17413 5887, ...
250 250 1 POLYGON ((17672 5887, 17672 6146, 17413 6146, ...
251 251 1 POLYGON ((17672 6146, 17672 6405, 17413 6405, ...
252 252 1 POLYGON ((17672 6405, 17672 6664, 17413 6664, ...

253 rows × 3 columns

tiles can be visualized using the tiles function in plotting module.

zs.pl.tiles(wsi, tissue_id=0, linewidth=0.5)

Evaluating tile quality#

Just as we evaluated the quality of entire tissue regions earlier, we can also assess the quality of individual tiles. This is particularly important because not all tiles are equally informative or suitable for analysis. Some tiles might be:

  • Out of focus

  • Contain scanning artifacts

  • Have poor contrast

  • Contain mostly background or whitespace

LazySlide provides scoring functions specifically designed for tiles. Let’s apply some common quality metrics:

zs.pp.score_tiles(wsi, scorers=["contrast", "focus"])
tile_id tissue_id geometry contrast focus
0 0 0 POLYGON ((4052 16394, 4052 16653, 3793 16653, ... 0.623529 0.0
1 1 0 POLYGON ((4052 16653, 4052 16912, 3793 16912, ... 0.623529 0.0
2 2 0 POLYGON ((4311 15617, 4311 15876, 4052 15876, ... 0.792157 0.0
3 3 0 POLYGON ((4311 15876, 4311 16135, 4052 16135, ... 0.768627 0.0
4 4 0 POLYGON ((4311 17171, 4311 17430, 4052 17430, ... 0.635294 0.0
... ... ... ... ... ...
248 248 1 POLYGON ((17672 5369, 17672 5628, 17413 5628, ... 0.690196 0.0
249 249 1 POLYGON ((17672 5628, 17672 5887, 17413 5887, ... 0.827451 0.0
250 250 1 POLYGON ((17672 5887, 17672 6146, 17413 6146, ... 0.666667 0.0
251 251 1 POLYGON ((17672 6146, 17672 6405, 17413 6405, ... 0.443137 0.0
252 252 1 POLYGON ((17672 6405, 17672 6664, 17413 6664, ... 0.545098 0.0

253 rows × 5 columns

wsi["tiles"]

Visualizing tile quality scores#

After calculating quality scores, it’s helpful to visualize them to identify patterns or problematic regions. LazySlide makes this easy by allowing you to color tiles based on any numerical property in the tiles dataframe.

Let’s visualize the contrast scores we just calculated:

zs.pl.tiles(wsi, tissue_id=0, color="contrast", cmap="rainbow")

Step 8: Feature extraction - transforming images into numerical representations#

We’ve now reached a crucial step in our preprocessing pipeline: feature extraction. This process transforms our image tiles into numerical representations (feature vectors) that capture the morphological characteristics of the tissue.

Why feature extraction matters#

Feature extraction is essential because:

  1. It converts complex visual information into a format that machine learning algorithms can process

  2. It enables quantitative comparison between different tissue regions

  3. It forms the foundation for tasks like classification, clustering, and anomaly detection

Using vision models for feature extraction#

The most effective way to extract meaningful features from histology images is to use pre-trained vision models. These models, often trained on millions of images, have learned to recognize patterns and structures that are relevant to many visual tasks.

LazySlide supports a wide range of vision models:

  • Standard architectures from the timm library (ResNet, DenseNet, EfficientNet, etc.)

  • Specialized pathology foundation models

  • Custom models that you can provide

You can explore the available timm models with:

import timm
timm.list_models()
zs.tl.feature_extraction(wsi, "resnet50", device="mps")

Using foundation models for pathology#

While general-purpose vision models like ResNet50 work well for many tasks, LazySlide also provides access to specialized foundation models that have been specifically trained on histology images. These models often capture pathology-specific features more effectively.

Let’s see what foundation models are available:

zs.models.list_models()

Note

Some foundation models require access permissions from their creators. If you encounter an access error, you’ll need to:

  1. Request access at the corresponding Hugging Face repository

  2. Generate a token from your Hugging Face account

  3. Login using the token

For this tutorial, we’ll use h0-mini, which is a smaller model with open access that demonstrates the same workflow as larger foundation models.

zs.tl.feature_extraction(wsi, "h0-mini", device="mps")
wsi

Features are saved as AnnData store with a convention of “{model name}_{tiles key}”.

Feature aggregation basics#

To perform analysis across dataset, a usual way is to pool features into a 1D vector that can represent the entire slide.

Features can be aggregated at different levels, such as slide-level and tissue-level.

zs.tl.feature_aggregation(wsi, "resnet50")
zs.tl.feature_aggregation(wsi, "resnet50", by="tissue_id")
AnnData object with n_obs × n_vars = 253 × 2048
    obs: 'tile_id', 'tissue_id', 'contrast', 'focus'
    uns: 'tile_spec', 'slide_properties'
    obsm: 'spatial'
    varm: 'agg_slide', 'agg_tissue_id'
wsi["resnet50_tiles"]
wsi.fetch.features_anndata("resnet50")

Save on the disk#

Now that we finished with our preprocessing. Remember to save the wsidata. By default it is saved side by side with the WSI. When you open the WSI next time, it will automatically pick up the wsidata.

wsi.write()

Notice that after saving, your SpatialData is now associated with a disk storage.

wsi

You can always change where it should be saved

import tempfile

with tempfile.TemporaryDirectory() as tmp:
    store = f"{tmp}/temp.zarr"
    wsi.write(store)

    wsi = open_wsi("GTEX-1117F-0526.svs", store=store)