Skip to content

Analysis

MicheleBortol edited this page Nov 15, 2021 · 30 revisions

SIMPLI Analysis Steps

A) Raw image processing

The first step in SIMPLI analysis workflow is the preprocessing of raw images and it consists of 3 processes:

A.1) Image Extraction

In this process tiff files are extracted from the raw acquisition data from imaging mass cytometry (IMC) experiments. This process should be skipped if the input data does not consist of raw IMC data. See the input page for more details.

Inputs and parameters:

Outputs:

  • Images: Images (uncompressed 16 bit tiff) can be output in two different formats:
    • single channel tiff files (one for each of the selected channels) ($output_folder/Images/Raw/sample_name/sample_name-label-raw.tiff )
    • .ome.tiff files (one per sample, the order of channels is the same as in the the channel_metadata file). ($output_folder/Images/Raw/sample_name/sample_name-all_raw.ome.tiff)
  • Metadata:
    • Metadata for all images from all samples: $output_folder/Images/Raw/raw_tiff_metadata.csv
    • By sample metadata for the raw images is also output at at:
      $output_folder/Images/Raw/sample_name/sample_name-raw_tiff_metadata.csv

The output of this process is located at: $output_folder/Images/Raw/

This process can be skipped by setting the skip_conversion parameter to true.

A.2) Image normalisation

This process performs 99th percentile normalisation of the raw tiff images generated in the Image extraction process or specified by the user with if the image extraction process is skipped. Images are normalized individually by marker and by sample, thus enabling the use of a single threshold for the same marker across multiple samples. This might not always be desirable if the staining is not uniform within samples (images). Additionally, if the images for some markers have particularly low signal-to-noise ratios, as the 99th percentile cutoff for normalisation could be too stringent. In these cases the normalization can be skipped and sample specific thresholds can be used in the image thresholding and masking step.

Inputs and parameters:

Outputs:

  • Normalised Images: Images (uncompressed 16 bit tiff) can be output in two different formats:
    • single channel tiff files (one for each of the selected channels) ($output_folder/Images/Normalized/sample_name/sample_name-label-normalized.tiff )
    • .ome.tiff files (one per sample, the order of channels is the same as in the the channel_metadata file). (output_folder/Images/Normalized/sample_name/sample_name-ALL-normalized.ome.tiff)
  • Metadata:
    • Metadata for all images from all samples: $output_folder/Images/Normalized/normalized_tiff_metadata.csv
    • By sample metadata for the normalized images is also output at at:
      • $output_folder/Images/Normalized/sample_name/sample_name-normalized_tiff_metadata.csv in long format.
      • $output_folder/Images/Normalized/sample_name/sample_name-normalized_tiff_metadata.csv in CellProfiler4 compatible wide format.

The output of this process is located at: $output_folder/Images/Normalized/

This process can be skipped by setting the skip_normalization parameter to true.

A.3) Image thresholding and masking

This process is used to perform the image preprocessing that will generate the final images, which can then be used as input for the pixel-based or the cell-based analysis. The input images for this process can be derived from:

Inputs and parameters:

Outputs:

  • Preprocessed Images: (uncompressed 16 bit single-channel tiff)
    $output_folder/Images/Preprocessed/sample_name/sample_name-label-Preprocessed.tiff
  • Metadata:
    • Metadata for all images from all samples $output_folder/Images/Preprocessed/preprocessed_tiff_metadata.csv
    • By sample metadata for the preprocessed images is also output at at:
      • $output_folder/Images/Preprocessed/sample_name/sample_name-preprocessed_metadata.csv in long format.
      • $output_folder/Images/Preprocessed/sample_name-cp4-preprocessed_metadata.csv in CellProfiler4 compatible wide format.

The output of this process is located at: $output_folder/Images/Preprocessed/

This process can be skipped by setting the skip_preprocessing parameter to true.

B) Pixel-based analysis

The pixel-based approach implemented in SIMPLI enables the quantification of pixels which are positive for a specific marker or combination of markers. These marker-positive areas can be normalised over the area of the whole image, or the areas of an image mask defined by a the combination of any of the input images with logical operators.

B.1) Measurement of positve-marker areas

This process measures the areas of interest and normalises them on the selected image masks according to the input metadata. The input images for this process can be derived from:

Inputs and parameters:

  • preprocessed_metadata_file with the tiff image metadata.
  • area_measurements_metadata Path to the area_measurements_metadata file, it has two columns:
    • marker = Marker or combination of markers whose area should be measured.
    • main_marker = Marker or combination of markers whose area should be used to normalise the area of marker. If main_marker is the same as marker then the whole area of the image is used for normalisation.

marker and main_marker value should be either a value from the label column of the preprocessed_metadata_file or a combination of those values with logical operators (AND = &, OR = |, NOT = !, () = round brackets).

Outputs: The area measurements are saved in $output_folder/area_measurements.csv. The file has the following columns:

  • sample_name = Sample name.
  • main_marker = Combination of markers used to normalize the marker area.
  • marker = Main combination of markers measured.
  • area = Area positive for the marker combination of markers.
  • main_marker_area = Area positive for the main_marker combination of markers.
  • total_ROI_area = Total image area for this sample.
  • percentage = Area of the marker (area) / area of the main marker (main_marker_area) * 100.

All areas are in pixel2.

This process can be skipped by setting the skip_area parameter to true.

B.2) Pixel-based analysis visualisation

Generate boxplots showing the comparisons of the distributions of normalised marker-positive areas between 2 categories of samples. The input data for this process can be derived from:

Inputs and parameters:

  • sample_metadata_file with the metadata of all samples used in the analysis.
  • area_measurements_file Path to the area_measurements_file it should have the following columns:
    • sample_name = Sample name, should match a value in the sample_metadata_filemetadata file.
    • main_marker = Marker or combination of marker used for normalisation.
    • marker = Marker or combination of marker used to calculate the area.
    • percentage = Area of the marker / area of the main marker * 100.

FDR is calculated using the number of different marker values for each value of main_marker.

Outputs: The area measurements are saved in $output_folder/Plots/Area_Plots/Boxplots/ a separate folder is created for each main_marker. For each main_marker a pdf file ($output_folder/Plots/Area_Plots/Boxplots/main_marker/main_marker_area_boxplots.pdf) containing a boxplot for each value of marker associated to that main_marker.

The output of this process is located at: $output_folder/Plots/Area_Plots/Boxplots/

This process can be skipped by setting the skip_area_visualization parameter to true.

C) Cell-based analysis

The cell-based analysis aims to investigate the qualitative and quantitative cell representation within the imaged tissue through (1) cell segmentation, cell phenotyping by unsupervised clustering or expression thresholding and spatial analysis of cell densities (homotypic spatial analysis) and distances (heterotypic spatial analysis). The steps of the cell-based analysis are:

C.1A) Cell segmentation with CellProfiler

Generate single-cell data in .csv format and the cell masks in tiff format. The input data for this process can be derived from:

Inputs and parameters:

Outputs:

The output of this process is located at: $output_folder/CellProfiler4_Segmentation/

  • Single cell data:

    • Single cell data for all samples: $output_folder/CellProfiler4_Segmentation/CellProfiler4-unannotated_cells.csv
    • Single cell data for each sample separately: $output_folder/CellProfiler4_Segmentation/sample_name/sample_name-CellProfiler4-Cells.csv

    The single-cell data is a.csv table with a row for each cell and the following annotations:

    • ImageNumber: CellProfiler4 specific image identifier.
    • ObjectNumber: Unique identity number from 1 to 216-1, matches the corresponding pixels in the cell masks.
    • Metadata_sample_name: Matching the sample_name values in the preprocessed_metadata_file.
    • Location_Center_X and Location_Center_Y: Location of the cell centroid in the image in pixel, used for both the homotypic and heterotypic spatial analyses.
    • CellProfiler4 marker intensity measurements: Used for cell phenotyping by Unsupervised clustering or by Expression thresholding

    The exact set of fields and their order depends on the CellProfiler4 pipeline used in the analysis.

  • Cell masks:
    Cell masks in uint16 tiff format: $output_folder/CellProfiler4_Segmentation/sample_name/sample_name-CellProfiler4-Cell_Mask.tiff To each cell is associated a unique identity number from 1 to 216-1. All the pixel belonging to a given cell have their value set to its identity number. Pixels not belonging to any cell are set to 0.
    These images are compatible with several other tools for downstream analysis including:

To use the cells identified with this process in the downstream steps:

  • cell_source = "CellProfiler" not required if only one of the two segmentation tools is used.

This process can be skipped by setting the skip_cp_segmentation parameter to true.

C.1B) Cell segmentation with StarDist

Generate single-cell data in.csv format and the cell masks in tiff format. The input data for this process can be derived from:

For more details on segmentation with StarDist please refer to the following pages:

For more details on StarDist Segmentation in SIMPLI please refer to this page.

Inputs and parameters:

  • preprocessed_metadata_file with the tiff image metadata.
  • sd_labels_to_segment = markers to include in the image on which the segmentation is performed, must match the number of dimensions in the model. (comma separated list)
  • sd_model_name = model to use for the segmentation (name of default model or a pretrained one)
  • sd_model_path = path to the model or "default" for default models
  • sd_prob_thresh = probability threshold used for calling cells: 0 < value < 1 or "default" to use the default valuse saved in the model.
  • sd_nms_thresh = overlap threshold above which Non-Maximum Suppression is performed: 0 < value < 1 or "default" to use the default valuse saved in the model

Outputs: The output of this process is located at: $output_folder/StarDist_Segmentation/

  • Single cell data:

    • Single cell data for all samples: $output_folder/StarDist_Segmentation/StarDist-unannotated_cells.csv
    • Single cell data for each sample separately: $output_folder/StarDist_Segmentation/sample_name/sample_name-StarDist-Cells.csv

    The single-cell data is a.csv table with a row for each cell and the following annotations:

    • ObjectNumber: Unique identity number from 1 to 216-1, matches the corresponding pixels in the cell masks.
    • Metadata_sample_name: Matching the sample_name values in the preprocessed_metadata_file.
    • Location_Center_X and Location_Center_Y: Location of the cell centroid in the image in pixel, used for both the homotypic and heterotypic spatial analyses.
    • marker intensity measurements: minimum, maximun and mean
    • spatial features computed with skimage.measure.regionprops from the skimage library.
  • Cell masks:
    Cell masks in uint16 tiff format: $output_folder/StarDist_Segmentation/sample_name/sample_name-StarDist-Cell_Mask.tiff To each cell is associated a unique identity number from 1 to 216-1. All the pixel belonging to a given cell have their value set to its identity number. Pixels not belonging to any cell are set to 0.
    These images are compatible with several other tools for downstream analysis including:

To use the cells identified with this process in the downstream steps:

  • cell_source = "StarDist" not required if only one of the two segmentation tools is used.

This process can be skipped by setting the skip_sd_segmentation parameter to true.

C.2) Cell masking

This process allows to identify cells belonging to different populations or tissue compartments according to the overlap of their areas with those of specific masks:

The input images for this process can be derived from:

The input cell masks for this process can be derived from:

  • cell masks generated in the cell segmentation process.
  • cell masks specified by the user with the single_cell_masks_metadata file if the cell segmentation process is skipped.

The input cell data for this process can be derived from:

  • cell data generated in the cell segmentation process.
  • cell data specified by the user with the preprocessed_metadata_file file if the cell segmentation process is skipped.

Inputs and parameters:

  • preprocessed_metadata_file with the tiff image metadata.
  • single_cell_masks_metadata with the following columns:
    • sample_name = Sample name matching a value in the preprocessed_metadata_file file
    • label = "Cell_Mask"
    • file_name = path to a cell mask in uint16 tiff format
  • single_cell_data_file = A .csv file with the following columns:
    • Metadata_sample_name = Sample name matching a value in the preprocessed_metadata_file file
    • ObjectNumber = Unique number identifying the pixel belonging to the cell in cell mask.
  • cell_masking_metadata = A .csv file indicating which masks to use and which thresholds of overlap to apply, it should have the following columns:
    • cell_type = name of the cell type being identified.
    • threshold_marker = marker to use as mask. It should match a value in the label column of the preprocessed_metadata_file. It can be a combination of markers specified with logical operators (AND = &, OR = |, NOT = !, () = round brackets).
    • threshold_value = 1 - fraction of area overlap between the cell and the mask. Cells whose area is overlapping the mask by a fraction higher than threshold marker are considered as positive.

If a cell is positive for more than one cell type, than it is assigned to the cell type defined first (by row order) in the cell_masking_metadata file. Cells negative for all cell_types are marked as UNASSIGNED.

Outputs:
The annotated cell table is a .csv table with the same columns as the input table plus the following annotations:

  • cell_type: Name used to identify the cell type during the analysis.
  • CellName: Unique Cell identity string in the form: Metadata_sample_name_ObjectNumber
    The cell type level table is saved at: $output_folder/annotated_cells.csv

This process can be skipped by setting the skip_cell_type_identification parameter to true.

C.3) Cell masking visualisation

This process allows to plot the results of the cell masking process. The input cell masks for this process can be derived from:

  • cell masks generated in the cell segmentation process.
  • cell masks specified by the user with the single_cell_masks_metadata file if the cell segmentation process is skipped.

The input cell data for this process can be derived from:

  • cell data generated in the cell segmentation process.
  • cell data specified by the user with the preprocessed_metadata_file file if the cell segmentation process is skipped.

Inputs and parameters:

  • sample_metadata_file with the metadata of all samples used in the analysis.
  • single_cell_masks_metadata with the following columns:
    • sample_name = Sample name matching a value in the sample_metadata_file file
    • label = "Cell_Mask"
    • file_name = path to a cell mask in uint16 tiff format
  • annotated_cell_data_file = A .csv file with the following columns:
    • Metadata_sample_name = Sample name matching a value in the sample_metadata_file file
    • cell_type = name of the cell type being identified.
  • cell_masking_metadata = A .csv file indicating which masks to use and which thresholds of overlap to apply, it should have the following columns:
    • cell_type = name of the cell type being identified.
    • threshold_marker = marker to use as mask. It should match a value in the label column of the preprocessed_metadata_file. It can be a combination of markers specified with logical operators (AND = &, OR = |, NOT = !, () = round brackets).
    • threshold_value = 1 - fraction of area overlap between the cell and the mask. Cells whose area is overlapping the mask by a fraction higher than threshold marker are considered as positive.
    • color = Color used to represent this cell type. Accepted values are color names or hexadecimal #RGB or #RGBA format ("#RRGGBB" or "#RRGGBBAA"). Cells of cell_type = "UNASSIGNED" are automatically assigned the color "#888888".

Outputs:
The cell type level plots are saved in $output_folder/Plots/Cell_Type_Plots/ and they are divided in:

  • Barplots: $output_folder/Plots/Cell_Type_Plots/Barplots .pdf files with barplots with the proportions of all cell types + unassigned cells in:

    • Each sample: one bar per sample.
    • Category (optional): one bar per category, If the comparison column in the sample_metadata_file file contains 2 categories. The barplots are divided in the following .pdf files:
      • dodged_barplots.pdf = dodged barplots including "UNASSIGNED" cells.
      • dodged_assigned_ony_barplots.pdf = dodged barplots excluding "UNASSIGNED" cells.
      • stacked_barplots.pdf = stacked barplots including "UNASSIGNED" cells.
      • stacked_assigned_only_barplots.pdf = stacked barplots excluding "UNASSIGNED" cells.
  • Overlays: $output_folder/Plots/Cell_Type_Plots/Overlays/

    • One overlay-sample_name.tiff image per sample. Each cell is coloured by cell type according to the color specified in the cell types metadata file
    • overlay_legend.pdf: legend mapping each cell type to its color.
  • Boxplots (Optional): $output_folder/Plots/Cell_Type_Plots/Boxplots/
    If the comparison column in the sample_metadata_file file contains 2 categories,two pdf files are porduced each, with a boxplot for each cell type:

    • boxplots.pdf = boxplots including "UNASSIGNED" cells.
    • assigned_ony_boxplots.pdf = boxplots excluding "UNASSIGNED" cells.
      The FDR is calculated with the Benjamini-Hochberg procedure.

This process can be skipped by setting the skip_type_visualization parameter to true.

C.4A.1) Unsupervised clustering

This process allows to perform unsupervised clustering on cells from one or more set of cells. The input cell data for this process can be derived from:

  • cell data annotated in the cell masking process.
  • cell data specified by the user with the annotated_cell_data_file file if the cell masking process is skipped.

Inputs and parameters:

  • sample_metadata_file with the metadata of all samples used in the analysis. If the value of the comparison column for the sample is "NA"all cells from the sample are excluded from the clustering.
  • annotated_cell_data_file = A .csv file with the following columns:
    • Metadata_sample_name = Sample name matching a value in the sample_metadata_file file.
    • cell_type = name of the cell type being identified.
    • ObjectNumber = number identifiying a cell. Needs to be unique within each sample.
    • Columns with the expression values of the markers used for clustering, the names should match the values in the clustering_markers column in the cell_clustering_metadata file.
  • cell_clustering_metadata metadata file with the parameters for the cell phenotyping by unsupervised clustering. It contains the following columns:
    • cell_type = name of the cell type to use for phenotyping. Set to "NA" to use all cells in the sample.
    • clustering_markers = @ separated list of markers to use for clustering. The markers must match a column name from the annotated_cell_data_file.
    • clustering_resolutions = @ separated list of resolutions used to extract the clusters from the graph, use a value above (below) 1.0 if you want to obtain a larger (smaller) number of clusters.

See the original Seurat function for details.

Outputs:
The output files divided by cell type are saved in separate subfolders named after the cell type at: $output_folder/Cell_Clusters/CELLTYPE. For each clustered cell type this step outputs:

  • Cell cluster table: CELLTYPE-clusters.csv with the following columns:
    • CellName: Cell identity string in the form: Metadata_sample_name_ObjectNumber
    • Metadata_sample_name: sample name as in the sample_metadata_file file.
    • Clustering resolution columns: res-RESOLUTION-ids for each clustered cell type. Clusters are numbered from 0, the same numbering is used in the plots.
    • ObjectNumber: Unique identity number from 1 to 216-1, matches the corresponding pixels in the cell masks.
    • Marker intensity measurements.
    • cell_type: Name used to identify the clustered cell type during the analysis.
  • Cell cluster RData: CELLTYPE-clusters.RData The Seurat 2.3.0 object. See the original [Seurat page]https://github.com/satijalab/seurat/blob/v2.3.0/R/seurat.R) for details. This can be converted to a Seurat object compatible with the latest Seurat version with the UpdateAssay function.

A collected clustered cells table is saved at: $output_folder/clustered_cells.csv. This file is a .csv table with a row for each cell in the cell types that underwent clustering and the following annotations:

  • comparison: Name of the cell cluster table divided by cell type containing the cell.
  • CellName: Cell identity string in the form: Metadata_sample_name_ObjectNumber
  • Metadata_sample_name: sample name as in the sample_metadata_file file.
  • Clustering resolution columns: res-RESOLUTION-ids for each clustered cell type. Clusters are numbered from 0, the same numbering is used in the plots.
  • ObjectNumber: Unique identity number from 1 to 216-1, matches the corresponding pixels in the cell masks.
  • Marker intensity measurements.
  • cell_type: Name used to identify the cell type during the analysis.

This process can be skipped by setting the skip_cell_clustering parameter to true

C.4A.2) Unsupervised clustering visualization

This process allows to plot the results of the unsupervised clustering process. The input annotated cell data for this process can be derived from:

Inputs and parameters:

  • sample_metadata_file with the metadata of all samples used in the analysis. If the value of the comparison column for the sample is "NA"all cells from the sample are excluded from the clustering.
  • clustered_cell_data_file = A .csv file with the following columns:
    • Metadata_sample_name = Sample name matching a value in the sample_metadata_file file.
    • cell_type = name of the cell type being identified.
    • ObjectNumber = number identifiying a cell. Needs to be unique within each sample.
    • Columns with the expression values of the markers used for clustering, the names should match the values in the clustering_markers column in the cell_clustering_metadata file.
    • Columns with the cluster annotation for each cell. The column names should match this format: res_RESOLUTION_ids where RESOLUTION matches one of the values of the resolution column in:
  • cell_clustering_metadata metadata file with the parameters for the cell phenotyping by unsupervised clustering. It contains the following columns:
    • cell_type = name of the cell type to use for phenotyping. Set to "NA" to use all cells in the sample.
    • clustering_markers = @ separated list of markers to use for clustering. The markers must match a column name from the annotated_cell_data_file.
    • clustering_resolutions = @ separated list of resolutions used to extract the clusters from the graph, use a value above (below) 1.0 if you want to obtain a larger (smaller) number of clusters.
  • high_color = Color for the max expression value in the heatmap or UMAP defaults to "'#FF0000'"
  • mid_color = Color for the midpoint of the expression value in the heatmap or UMAP defaults to "'#FFFFFF'"
  • low_color = Color for the minimum expression value in the heatmap or UMAP defaults to "'#0000FF'" Accepted values are color names or hexadecimal #RGB or #RGBA format ("#RRGGBB" or "#RRGGBBAA").

Outputs:
The plots illustrating the results of the unsupervised clustering are saved in $output_folder/Plots/Cell_Cluster_Plots/ and they are divided in:

  • UMAPs: $example_output/Plots/Cell_Cluster_Plots/CELL_TYPE/UMAPs/
    For each clustering resolution a .pdf file with UMAP plots colored by:

  • Boxplots (Optional): $output_folder/Plots/Cell_Cluster_Plots/Cluster_Comparisons/
    If the comparison metadata column of the sample_metadata_file has exactly 2 (non "NA") categories: For each level of resolution a .pdf file is produced, the file contains:
    + Heatmap: showing for each cluster the expression of the markers used for the clustering.
    + Boxplots: one for each cluster, with the percentage of cells belonging to that cluster on the total cells in the clustered cell type. The FDR is calculated using the Benjamini-Hochberg procedure for all clusters.

  • Heatmaps (Optional) If the comparison metadata column does not have exactly 2 (non "NA") categories. For each level of resolution a .pdf file is produced containing an heatmap showing for each cluster the expression of the markers used for the clustering.

This process can be skipped by setting the skip_cluster_visualization parameter to true

C.4B.1) Expression thresholding

This process allows to phenotype cells from one or more set by expression thresholding. The input cell data for this process can be derived from:

  • cell data annotated in the cell masking process.
  • cell data specified by the user with the annotated_cell_data_file file if the cell masking process is skipped.

Inputs and parameters:

  • cell_thresholding_metadata Metadata file defining thresholds and phenotypes:
    • cell_type = name of the cell type to use for phenotyping. Set to "NA" to use all cells in the sample.
    • phenotype_name = name of the phenotype to use for cells passing the expression thresholding.
    • threshold_expression = Definition of the threshold or thresholds to apply. It should be written as marker_expression_column comparison operator (accepted operators are >, <, >=, <=, ==, !=) value. Marker intesities can be combined with arithmetic operators (+, -, *, /, ...). Expressions for different thresholds can be combined with logical operators (&, |, !). If a cell passes more than one threshold_expression for the same cell_type then it is with the value of phenotype_name corresponding to the last passed threshold_expression in the cell_thresholding_metadata file.
  • annotated_cell_data_file = A .csv file with the following columns:
    • Metadata_sample_name = Sample name matching a value in the sample_metadata_file file.
    • cell_type = name of the cell type being identified.
    • ObjectNumber = number identifiying a cell. Needs to be unique within each sample.
    • Columns with the expression values of the markers used for clustering, the names should match the values in the threshold_expression column in the cell_thresholding_metadata file.

Outputs:
The thresholded cell table is a .csv table with the same columns as the input table plus the following annotations:

  • CellType_Thresholded: columns, one for each value of the cell_type column in the cell_thresholding_metadata file. Its value can be one of:
    • NA = This cell was not annotated because it has a different cell_type from the one being phenotyped.
    • cell_phenotype = One of the values of the phenotype_name column. If a cell passes more than one threshold_expression for the same cell_type then it is with the value of phenotype_name corresponding to the last passed threshold_expression in the cell_thresholding_metadata file.
    • UNASSIGNED = If the cell does not pass any threshold_expression for that cell_type.

The thresholded cell table is saved at: $output_folder/thresholded_cells.csv

This process can be skipped by setting the skip_cell_thresholding parameter to true

C.4B.2) Expression thresholding visualisation

This process allows to plot the results of the cell phenotyping by the expression thresholding process. The input annotated cell data for this process can be derived from:

The input cell masks for this process can be derived from:

  • cell masks generated in the cell segmentation process.
  • cell masks specified by the user with the single_cell_masks_metadata file if the cell segmentation process is skipped.

Inputs and parameters:

  • sample_metadata_file with the metadata of all samples used in the analysis. If the value of the comparison column for the sample is "NA", then the sample is excluded from the plotting.
  • cell_thresholding_metadata Metadata file defining thresholds and phenotypes:
    • cell_type = name of the cell type to use for phenotyping. Set to "NA" to use all cells in the sample.
    • phenotype_name = name of the phenotype to use for cells passing the expression thresholding.
    • threshold_expression = Definition of the threshold or thresholds to apply. It should be written as marker_expression_column comparison operator (accepted operators are >, <, >=, <=, ==, !=) value. Marker intesities can be combined with arithmetic operators (+, -, *, /, ...). Expressions for different thresholds can be combined with logical operators (&, |, !). If a cell passes more than one threshold_expression for the same cell_type then it is with the value of phenotype_name corresponding to the last passed threshold_expression in the cell_thresholding_metadata file.
    • color = Color used to represent the cell phenotype in barplots, and density plots.
    • plotting_markers = @ separated list of markers to include in the heatmap for this cell_type they must match the names of the marker_expression_column in the thresholded_cell_data_file
  • thresholded_cell_data_file = A .csv file with the following columns:
    • Metadata_sample_name = Sample name matching a value in the sample_metadata_file file.
    • cell_type = name of the cell type being identified.
    • ObjectNumber = number identifiying a cell. Needs to be unique within each sample.
    • CellType_Thresholdedcolumns: one for each value of the cell_type column in the cell_thresholding_metadata file.
    • marker_expression_column columns: columns with the expression values of the markers used for clustering, the names should match the values in the threshold_expression column in the cell_thresholding_metadata file.
  • single_cell_masks_metadata with the following columns:
    • sample_name = Sample name matching a value in the sample_metadata_file file
    • label = "Cell_Mask"
    • file_name = path to a cell mask in uint16 tiff format
  • high_color = Color for the max expression value in the heatmap or UMAP defaults to "'#FF0000'"
  • mid_color = Color for the midpoint of the expression value in the heatmap or UMAP defaults to "'#FFFFFF'"
  • low_color = Color for the minimum expression value in the heatmap or UMAP defaults to "'#0000FF'"

Accepted values are color names or hexadecimal #RGB or #RGBA format ("#RRGGBB" or "#RRGGBBAA").

This process can be skipped by setting the skip_thresholding_visualization parameter to true.

Outputs:
The output plots are saved at: $output_folder/Plots/Cell_Threshold_Plots/ :

  • Barplots: $output_folder/Plots/Cell_Type_Plots/Barplots .pdf files (one for each cell_type) with barplots with the proportions of all phenotype_name + unassigned cells in:
    • Each sample: one bar per sample.
    • Category (optional): one bar per category, If the comparison column in the sample_metadata_file file contains 2 categories.
  • Overlays: $output_folder/Plots/Cell_Type_Plots/Overlays/
    • One cell_type-overlay-sample_name.tiff image for each cell_type for each sample. Each cell is coloured by phenotype_name according to the color specified in the color of the cell_thresholding_metadata file.
    • overlay_legend.pdf: legend mapping each cell type to its color.
  • Boxplots: $output_folder/Plots/Cell_Threshold_Plots/Boxplots/
    If the comparison metadata column of the sample_metadata_file has exactly 2 (non "NA") categories: For each cell_type in the cell_thresholding_metadata file a .pdf file is produced, the file contains one boxplot for each phenotype_name, with the percentage of cells belonging to that phenotype_name on the total cells in the cell phenotype. The FDR is calculated using the Benjamini-Hochberg procedure for all cell phenotypes.
  • Density Plots: Plots/Cell_Threshold_Plots/Density_Plots/
    Density plots showing the distribution of cells of the cell_type according to the expression of all the markers in the threshold_expression of each phenotype_name. The number of cells and the expression values are represented in Log scale.
  • Heatmaps: $output_folder/Plots/Cell_Threshold_Plots/Heatmaps
    For each cell_type a .pdf file is produced containing an heatmap showing for each phenotype_name the expression of the markers specified in the plotting_markers column of the cell_thresholding_metadata file.

This process can be skipped by setting the skip_thresholding_visualization parameter to true.

C.5A.1) Homotypic spatial analysis

This process allows to identify high-density aggregations of cells of a given cell type or phenotype using the DBSCAN: Density-Based Spatial Clustering and Application with Noise algorithm as implemented in the fpc R Package. The input annotated cell data for this process can be derived from:

  • Annotated cells from the cell masking process or supplied through the annotated_cell_data_file file parameter.
  • Cells phenotyped from the unsupervised clustering process or supplied through theclustered_cell_data_file file parameter.
  • Cells phenotyped from the expression thresholding process or supplied throughthresholded_cell_data_file file parameter. The analyses can be performed on cell types and phenotypes from any combination of these three sources.

Inputs and parameters:

  • homotypic_interactions_metadata Metadata file with these columns:
    • cell_file: File to read the cell annotations from; it must be one of:
      • identification: annotated cells from the cell masking process or supplied through the annotated_cell_data_file file parameter.
      • thresholding: cells phenotyped from the unsupervised clustering process or supplied through theclustered_cell_data_file file parameter.
      • clustering: cells phenotyped from the expression thresholding process or supplied throughthresholded_cell_data_file file parameter.
    • cell_type_column: Name of the column in the cell_file containing the annotation of the cell tyte or phenotype to cluster.
    • cell_type_to_cluster: Name of the cell type or phenotype to cluster, must match one of the values of the cell_type_column in the cell_file.
    • reachability_distance: eps argument of the dbscan function of the fpc R package. Reachability distance, see Ester et al. (1996).
    • min_cells: MinPts argument of the dbscan function of the fpc R package. MinPtsReachability minimum no. of points, see Ester et al. (1996).
  • File/s with the cell data:
    One for each of the values of the cell_file column of the homotypic_interactions_metadata file. It must have these columns:
    • Metadata_sample_name: Sample name matching a value in the sample_metadata_file file.
    • CellName: Unique identifier for each cell.
    • Location_Center_X: X coordinate of the cell centroid in the image.
    • Location_Center_Y: Y coordinate of the cell centroid in the image.

Location Center X and Location Center Y can be otained in several ways for instance from the IdentifyPrimaryObjects module in CellProfiler4 as SIMPLI does, or a they could also be derived with the computeFeatures function from the EBImage R Package.

Outputs:
The output of this process is stored at: $output_folder/Homotypic_interactions

  • Files for individual cell_types are stored at: $output_folder/Homotypic_interactions/cell_type/cell_type-homotypic_clusters.csv
  • A total file collecting the annotations for all cell_types are stored at: $output_folder/Homotypic_interactions/homotypic_interactions.csv

These files contains the following columns: CellName = Unique identifier for each cell. Metadata_sample_name = Sample name matching a value in the sample_metadata_file file. Location_Center_X = X coordinate used for the DBSCAN clustering. Location_Center_Y = Y coordinate used for the DBSCAN clustering. spatial_analysis_cell_type = Contains the cell types or phenotypes that were annotated in the cell_type_columnin thecell_file. cluster= column indicating cluster membership with noise observations (singletons) coded as 0.isseed` = column indicating whether a point is a seed (not border, not noise).

See the fpc::dbscan documentation for details.

This process can be skipped by setting the skip_homotypic_interactions parameter to true.

C.5A.2) Homotypic spatial analysis visualisation

This process allows to plot the results of the homotypic spatial analysis process. The input annotated cell data for this process can be derived from:

The input cell masks for this process can be derived from:

  • cell masks generated in the cell segmentation process.
  • cell masks specified by the user with the single_cell_masks_metadata file if the cell segmentation process is skipped.

Inputs and parameters:

  • homotypic_interactions_metadata Metadata file with these columns:
    • cell_file: File to read the cell annotations from; it must be one of:
      • identification: annotated cells from the cell masking process or supplied through the annotated_cell_data_file file parameter.
      • thresholding: cells phenotyped from the unsupervised clustering process or supplied through theclustered_cell_data_file file parameter.
      • clustering: cells phenotyped from the expression thresholding process or supplied throughthresholded_cell_data_file file parameter.
    • cell_type_column: Name of the column in the cell_file containing the annotation of the cell tyte or phenotype to cluster.
    • cell_type_to_cluster: Name of the cell type or phenotype to cluster, must match one of the values of the cell_type_column in the cell_file.
    • reachability_distance: eps argument of the dbscan function of the fpc R package. Reachability distance, see Ester et al. (1996).
    • min_cells: MinPts argument of the dbscan function of the fpc R package. MinPtsReachability minimum no. of points, see Ester et al. (1996).
    • color: color to use to represent the cell type / phenotype in the plots.
  • single_cell_masks_metadata with the following columns:
    • sample_name = Sample name matching a value in the sample_metadata_file file
    • label = "Cell_Mask"
    • file_name = path to a cell mask in uint16 tiff format

Outputs:
The output of this process is stored at: $output_folder/Plots/Homotypic_interactions_Plots: Position maps: Map of the image showing dots representing the position of the centroid of each cell in the image. Cells are colored in:

  • black: cells not belonging to a DBSCAN cluster.
  • color from the homotypic_interactions_metadata: cells belonging to a DBSCAN cluster. One file for each cell type / phenotype for each sample named: $output_folder/Plots/Homotypic_Interaction_Plots/cell_type/cell_type-sample_name-homotypic.pdf.

This process can be skipped by setting the skip_homotypic_visualization parameter to true.

C.5B.1) Heterotypic spatial analysis

This process allows to measure the distribution of the minimum distances between cells of two user defined cell types or phenotypes. This process measueres the distances between all cells of the 1st cell type or phenotype and all cells of the 2nd cell or phenotype, and for each cell of the first cell type or phenotype it returns the minimum distance to a cell of the 2nd cell type or phenotype. The input annotated cell data for this process can be derived from:

  • Annotated cells from the cell masking process or supplied through the annotated_cell_data_file file parameter.
  • Cells phenotyped from the unsupervised clustering process or supplied through theclustered_cell_data_file file parameter.
  • Cells phenotyped from the expression thresholding process or supplied throughthresholded_cell_data_file file parameter. The analyses can be performed on cell types and phenotypes from any combination of these three sources.

Inputs and parameters:

  • heterotypic_interactions_metadata metadata file with the following columns:
    • cell_file1: File to read the cell annotations for the first cell type or phenotype from; it must be one of:
      • identification: annotated cells from the cell masking process or supplied through the annotated_cell_data_file file parameter.
      • thresholding: cells phenotyped from the unsupervised clustering process or supplied through theclustered_cell_data_file file parameter.
      • clustering: cells phenotyped from the expression thresholding process or supplied throughthresholded_cell_data_file file parameter.
    • cell_type_column1: Name of the column in the cell_file containing the annotation of the first cell tyte or phenotype.
    • cell_type1: Name of the cell type or phenotype to cluster, must match one of the values of the first cell_type_column in the cell_file.
    • cell_file2: File to read the cell annotations for the second cell type or phenotype from; it must be one of:
      • identification: annotated cells from the cell masking process or supplied through the annotated_cell_data_file file parameter.
      • thresholding: cells phenotyped from the unsupervised clustering process or supplied through theclustered_cell_data_file file parameter.
      • clustering: cells phenotyped from the expression thresholding process or supplied throughthresholded_cell_data_file file parameter.
    • cell_type_column2: Name of the column in the cell_file containing the annotation of the second cell tyte or phenotype.
    • cell_type2: Name of the cell type or phenotype to cluster, must match one of the values of the second cell_type_column in the cell_file.
  • cell_files:
    One for each of the values of the cell_file column of the heterotypic_interactions_metadata file. It must have these columns:
    • cell_type_column: Name of the column in the cell_file containing the annotation of the cell tyte or phenotype to cluster.
    • Metadata_sample_name: Sample name matching a value in the sample_metadata_file file.
    • CellName: unique identifier for each cell.
    • Location_Center_X: X coordinate of the cell centroid in the image.
    • Location_Center_Y: Y coordinate of the cell centroid in the image.

Outputs:
The output of this process is saved at: $output_folder/Heterotypic_interactions/

  • Files for individual combinations cell type or phenotype are stored at: $output_folder/Heterotypic_interactions/cell_type1-cell_type2/cell_type1-cell_type2-distances.csv
  • A total file collecting the annotations for all cell_types are stored at: $output_folder/Heterotypic_interactions/heterotypic_interactions.csv

These files contain the following columns:

  • Metadata_sample_name: Sample name matching a value in the sample_metadata_file file.
  • CellName1: Unique identifier for each cell.
  • Location_Center_X1: X coordinate of the cell centroid in the image.
  • Location_Center_Y1: Y coordinate of the cell centroid in the image.
  • spatial_analysis_cell_type1: Cell type or phenotype of CellName1.
  • CellName2: Unique identifier for each cell.
  • Location_Center_X2: X coordinate of the cell centroid in the image.
  • Location_Center_Y2: Y coordinate of the cell centroid in the image.
  • spatial_analysis_cell_type2: Cell type or phenotype of CellName2.
  • distance = Euclidean distance distance between: CellName1(Location_Center_X1, Location_Center_Y1) CellName2(Location_Center_X2, Location_Center_Y2). The distance is measured in pixel.

This process can be skipped by setting the skip_heterotypic_interactions parameter to true.

C.5B.2) Heterotypic spatial analysis visualisation

This process allows to plot the results of the heterotypic distance analysis by the heterotypic spatial analysis process. The input annotated cell-cell distance data for this process can be derived from:

Inputs and parameters:

  • sample_metadata_file with the metadata of all samples used in the analysis. If the value of the comparison column for the sample is "NA", then no plotting is performed for this sample.
  • heterotypic_interactions_metadata metadata file with the following columns:
    • cell_file1: File to read the cell annotations for the first cell type or phenotype from; it must be one of:
      • identification: annotated cells from the cell masking process or supplied through the annotated_cell_data_file file parameter.
      • thresholding: cells phenotyped from the unsupervised clustering process or supplied through theclustered_cell_data_file file parameter.
      • clustering: cells phenotyped from the expression thresholding process or supplied throughthresholded_cell_data_file file parameter.
    • cell_type_column1: Name of the column in the cell_file containing the annotation of the first cell tyte or phenotype.
    • cell_type1: Name of the cell type or phenotype to cluster, must match one of the values of the first cell_type_column in the cell_file.
    • cell_file2: File to read the cell annotations for the second cell type or phenotype from; it must be one of:
      • identification: annotated cells from the cell masking process or supplied through the annotated_cell_data_file file parameter.
      • thresholding: cells phenotyped from the unsupervised clustering process or supplied through theclustered_cell_data_file file parameter.
      • clustering: cells phenotyped from the expression thresholding process or supplied throughthresholded_cell_data_file file parameter.
    • cell_type_column2: Name of the column in the cell_file containing the annotation of the second cell tyte or phenotype.
    • cell_type2: Name of the cell type or phenotype to cluster, must match one of the values of the second cell_type_column in the cell_file.
  • heterotypic_interactions_file file following columns:
    • Metadata_sample_name: Sample name matching a value in the sample_metadata_file file.
    • CellName1: Unique identifier for each cell.
    • Location_Center_X1: X coordinate of the cell centroid in the image.
    • Location_Center_Y1: Y coordinate of the cell centroid in the image.
    • spatial_analysis_cell_type1: Cell type or phenotype of CellName1.
    • CellName2: Unique identifier for each cell.
    • Location_Center_X2: X coordinate of the cell centroid in the image.
    • Location_Center_Y2: Y coordinate of the cell centroid in the image.
    • spatial_analysis_cell_type2: Cell type or phenotype of CellName2.
    • distance = Euclidean distance distance between: CellName1(Location_Center_X1, Location_Center_Y1) CellName2(Location_Center_X2, Location_Center_Y2). The distance is measured in pixel.

Outputs:
The outputs of this process are stored at: $output_older/Plots/Heterotypic_Interaction_Plots/Distance.
For each spatial_analysis_cell_type1-spatial_analysis_cell_type2 pair there is a folder $test_output/Plots/Heterotypic_Interaction_Plots/Distance/spatial_analysis_cell_type1-spatial_analysis_cell_type2/ with the following plots:

  • spatial_analysis_cell_type1-spatial_analysis_cell_type2-all-heterotypic.pdf: density plot with all the cells.
  • spatial_analysis_cell_type1-spatial_analysis_cell_type2-by_category-heterotypic.pdf (optional): density plot with the cells divided by sample category. If the comparison metadata column of the sample_metadata_file has at least 2 (non "NA") categories.

This process can be skipped by setting the skip_heterotypic_visualization parameter to true.

C.5B.3) Heterotypic analysis permutation test

This process generates a random distribution of the minimum distances between cells of the populations or phenotypes selected by the user. The distribution is generated by randomly reshuffling the labels of each cell. The input annotated cell data for this process can be derived from:

Inputs and parameters:

  • heterotypic_interactions_metadata metadata file with the following columns:
    • cell_file1: File to read the cell annotations for the first cell type or phenotype from; it must be one of:
      • identification: annotated cells from the cell masking process or supplied through the annotated_cell_data_file file parameter.
      • thresholding: cells phenotyped from the unsupervised clustering process or supplied through theclustered_cell_data_file file parameter.
      • clustering: cells phenotyped from the expression thresholding process or supplied throughthresholded_cell_data_file file parameter.
    • cell_type_column1: Name of the column in the cell_file containing the annotation of the first cell tyte or phenotype.
    • cell_type1: Name of the cell type or phenotype to cluster, must match one of the values of the first cell_type_column in the cell_file.
    • cell_file2: File to read the cell annotations for the second cell type or phenotype from; it must be one of:
      • identification: annotated cells from the cell masking process or supplied through the annotated_cell_data_file file parameter.
      • thresholding: cells phenotyped from the unsupervised clustering process or supplied through theclustered_cell_data_file file parameter.
      • clustering: cells phenotyped from the expression thresholding process or supplied throughthresholded_cell_data_file file parameter.
    • cell_type_column2: Name of the column in the cell_file containing the annotation of the second cell tyte or phenotype.
    • cell_type2: Name of the cell type or phenotype to cluster, must match one of the values of the second cell_type_column in the cell_file.
  • heterotypic_interactions_file
    • Metadata_sample_name: Sample name matching a value in the sample_metadata_file file.
    • CellName1: Unique identifier for each cell.
    • Location_Center_X1: X coordinate of the cell centroid in the image.
    • Location_Center_Y1: Y coordinate of the cell centroid in the image.
    • spatial_analysis_cell_type1: Cell type or phenotype of CellName1.
    • CellName2: Unique identifier for each cell.
    • Location_Center_X2: X coordinate of the cell centroid in the image.
    • Location_Center_Y2: Y coordinate of the cell centroid in the image.
    • spatial_analysis_cell_type2: Cell type or phenotype of CellName2.
    • distance = Euclidean distance distance between: CellName1(Location_Center_X1, Location_Center_Y1) CellName2(Location_Center_X2, Location_Center_Y2). The distance is measured in pixel.
  • permutations = Number of permutation to perform (values > 10000 are recommended)

Outputs:
The output of this process is saved at: $output_folder/Heterotypic_interactions/
A total file collecting the annotations for all cell_types is stored at: $output_folder/Heterotypic_interactions/permuted_interactions.csv This file contain the following columns:

  • permutation = Current round of permutation
  • Metadata_sample_name: Sample name matching a value in the sample_metadata_file file.
  • CellName1: Unique identifier for each cell.
  • Location_Center_X1: X coordinate of the cell centroid in the image.
  • Location_Center_Y1: Y coordinate of the cell centroid in the image.
  • spatial_analysis_cell_type1: Cell type or phenotype of CellName1.
  • CellName2: Unique identifier for each cell.
  • Location_Center_X2: X coordinate of the cell centroid in the image.
  • Location_Center_Y2: Y coordinate of the cell centroid in the image.
  • spatial_analysis_cell_type2: Cell type or phenotype of CellName2.
  • distance = Euclidean distance distance between: CellName1(Location_Center_X1, Location_Center_Y1) CellName2(Location_Center_X2, Location_Center_Y2). The distance is measured in pixel.

This process can be skipped by setting the skip_permuted_interactions parameter to true.

C.5B.4) Heterotypic analysis permutation test visualisation

This process allows to plot the results of the heterotypic distance analysis permutation test by the heterotypic analysis permutation test process. The input annotated cell-cell distance data for this process can be derived from:

Inputs and parameters:

  • sample_metadata_file with the metadata of all samples used in the analysis. If the value of the comparison column for the sample is "NA", then no plotting is performed for this sample.
  • heterotypic_interactions_metadata metadata file with the following columns:
    • cell_file1: File to read the cell annotations for the first cell type or phenotype from; it must be one of:
      • identification: annotated cells from the cell masking process or supplied through the annotated_cell_data_file file parameter.
      • thresholding: cells phenotyped from the unsupervised clustering process or supplied through theclustered_cell_data_file file parameter.
      • clustering: cells phenotyped from the expression thresholding process or supplied throughthresholded_cell_data_file file parameter.
    • cell_type_column1: Name of the column in the cell_file containing the annotation of the first cell tyte or phenotype.
    • cell_type1: Name of the cell type or phenotype to cluster, must match one of the values of the first cell_type_column in the cell_file.
    • cell_file2: File to read the cell annotations for the second cell type or phenotype from; it must be one of:
      • identification: annotated cells from the cell masking process or supplied through the annotated_cell_data_file file parameter.
      • thresholding: cells phenotyped from the unsupervised clustering process or supplied through theclustered_cell_data_file file parameter.
      • clustering: cells phenotyped from the expression thresholding process or supplied throughthresholded_cell_data_file file parameter.
    • cell_type_column2: Name of the column in the cell_file containing the annotation of the second cell tyte or phenotype.
    • cell_type2: Name of the cell type or phenotype to cluster, must match one of the values of the second cell_type_column in the cell_file.
  • heterotypic_interactions_file file following columns:
    • Metadata_sample_name: Sample name matching a value in the sample_metadata_file file.
    • CellName1: Unique identifier for each cell.
    • Location_Center_X1: X coordinate of the cell centroid in the image.
    • Location_Center_Y1: Y coordinate of the cell centroid in the image.
    • spatial_analysis_cell_type1: Cell type or phenotype of CellName1.
    • CellName2: Unique identifier for each cell.
    • Location_Center_X2: X coordinate of the cell centroid in the image.
    • Location_Center_Y2: Y coordinate of the cell centroid in the image.
    • spatial_analysis_cell_type2: Cell type or phenotype of CellName2.
    • distance = Euclidean distance distance between: CellName1(Location_Center_X1, Location_Center_Y1) CellName2(Location_Center_X2, Location_Center_Y2). The distance is measured in pixel.
  • shuffled_interactions_file file following columns:
    • permutation = Current round of permutation
    • Metadata_sample_name: Sample name matching a value in the sample_metadata_file file.
    • CellName1: Unique identifier for each cell.
    • Location_Center_X1: X coordinate of the cell centroid in the image.
    • Location_Center_Y1: Y coordinate of the cell centroid in the image.
    • spatial_analysis_cell_type1: Cell type or phenotype of CellName1.
    • CellName2: Unique identifier for each cell.
    • Location_Center_X2: X coordinate of the cell centroid in the image.
    • Location_Center_Y2: Y coordinate of the cell centroid in the image.
    • spatial_analysis_cell_type2: Cell type or phenotype of CellName2.
    • distance = Euclidean distance distance between: CellName1(Location_Center_X1, Location_Center_Y1) CellName2(Location_Center_X2, Location_Center_Y2). The distance is measured in pixel.

Outputs:
The outputs of this process are stored at: $output_older/Plots/Heterotypic_Interaction_Plots/Permutations.
For each spatial_analysis_cell_type1-spatial_analysis_cell_type2 pair there is a folder $test_output/Plots/Heterotypic_Interaction_Plots/Permutations/spatial_analysis_cell_type1-spatial_analysis_cell_type2/ with the following plots:

  • spatial_analysis_cell_type1-spatial_analysis_cell_type2-all-heterotypic_permutations.pdf: density plot with the expected distribution of the minimum distances between the two cell types in all samples (non NA in the sample_metadata_file).
  • spatial_analysis_cell_type1-spatial_analysis_cell_type2-category-heterotypic_permutations.pdf (optional): density plot with the expected distribution of the minimum distances between the two cell types in all samples (non NA in the sample_metadata_file) divided by sample category. If the comparison metadata column of the sample_metadata_file has at least 2 (non "NA") categories.
  • spatial_analysis_cell_type1-spatial_analysis_cell_type2-category-heterotypic_permutations.pdf (optional): ddensity plot with the expected distribution of the minimum distances between the two cell types in all samples (non NA in the sample_metadata_file) in the first category minus the second category. If the comparison metadata column of the sample_metadata_file has exactly least 2 (non "NA") categories.

The FDR is calculated with the correction across all spatial_analysis_cell_type1 spatial_analysis_cell_type2 combinations for each set of plots.

This process can be skipped by setting the skip_permuted_visualization parameter to true.