Poised to transform and revolutionize our understanding of biology and disease, spatial transcriptomic data (ST) provides valuable insights into the spatial organization of gene expression at the tissue level, cellular-level and subcellular level. At the tissue level, ST can help study tissue architecture, cell-cell interactions, spatially retsricted gene expression patterns, and disease pathology. At the cellular-level, ST can be used to study cellular diversity, developmental trajectories, cell-to-cell interactions, and regulatory networks within complex tissues. Further, Spatial transcriptomics can help identify rare cell populations and subtypes.
Finally, at the subcellular level, ST can be used to study RNA localization, RNA-protein interactions, and organelle-specific gene expression patterns within cells.
Spatial transcriptomics can also be used to understand cellular processes such as RNA transport, localization and translation. Innovations like spatially barcoded RNA sequencing enables unprecedented spatial resolution and sensitivity in gene expression analysis. By integrating spatially resolved gene expression data with other omics data and imaging modalities, researchers can gain a comprehensive understanding of complex biological systems such as cell-cell interactions, cellular dynamics and disease mechanisms within heterogeneous tissue microenvironments.
However, there are certain challenges associated with the analysis and visualization of spatial transcriptomics data. This blog delves deep into the challenges and suggests ways to mitigate them by suggesting and exploring a comprehensive category and list of tools for analysis and visualization for spatial transcriptomics data.
Spatial transcriptomic data which capture gene expression patterns pose several challenges, some of which are listed below:
Cell segmentation in ST data occurs due to variability in cell shape, size, and morphology. Further, cell segmentation becomes more challenging in regions of the tissue with dense cell clusters or overlapping cells. While advancement in ST technology now provides denser spot placement at sub-cellular levels, however, these newer methods struggle with cell segmentation and the assignment of spots to cells which require more than image based methods in order to take full advantage of the information profiled by ST.
Achieving high spatial resolution can be challenging due to several factors such as the density and distribution of spatial barcodes used to label different regions of the sample, tissue sectioning techniques, the distance between spatially resolved points or regions in the sample, mixing of gene expression signals due to cell-to-cell interactions, tissue architecture, and molecular diffusion processes, or the use of those image analysis algorithms which may struggle to accurately delineate individual cells or cellular boundaries in spatially resolved images. Besides, integrating ST data with imaging data, such as histological images or spatially resolved protein expression data presents challenges in data fusion, registration, and visualization due to differences in data modalities, scales, and resolutions.
Challenges in identifying spatially variable genes may arise from various factors like biological variations in cell-types, tissue architecture, and microenvironmental cues, technical noise and artifacts, spatial autocorrelation, sample size and spatial resolution, false discoveries, use of inappropriate statistical methods, or the failure to integrate ST data with spatial metadata such as histological annotations, cell segmentation masks, or spatial coordinates. Such datasets need to be preprocessed and curated before it can be used for downstream analysis.
ST datasets can be large and complex, containing gene expression profiles for thousands to millions of spatially defined spots or regions within tissue sections. Besides being voluminous, these datasets are high-dimensional with spatial coordinates for each spot or region. Managing and integrating spatial coordinates with gene expression data adds complexity to data storage and analysis workflows.
Further, spatial transcriptomics data often needs to be integrated with other omics data for which compatibility and consistency must be ensured. Metadata such as sample information, experimental conditions, imaging parameters, and quality control metrics must also be standardised and stored in a retrievable and analysable format. ST data may contain sensitive information such as genomic data, patient samples, or experimental details; it is important that platforms ensure data security, privacy and compliance with regulatory norms.
Finally, ST data in large-scale projects or longitudinal studies may need to be stored and archived for long periods which require data integrity, accessibility, backup, versioning, and data preserving practices.
Analysis of spatial transcriptomics data should account for their high-dimensionality and spatiality. There are several tools available for spatial data analysis and visualization of spatial transcriptomics data at each step in the pipeline which can be categorized into the following:
Raw Count Pipelines process converts raw sequencing data obtained from spatially resolved gene expression experiments into count matrices. Count matrices represent the number of times each gene is observed in each spatially defined spot or feature within a tissue section. The raw count pipeline involves quality control, barcode demultiplexing to assign reads to their respective samples or spots, alignment to reference genome, and gene quantification where aligned reads are counted to determine the number of reads that map to each gene. After this, the filtered gene expression counts are organised into count matrices where rows represent genes and columns represent spots or regions within the tissue section. Tools such as STtools, Space Ranger, and STARsolo help generate raw count matrices. These tools are adept at working with data from various platforms such as Seq-Scope, Slide-seq, VISIUM etc.
Raw gene expression counts need to be normalized to account for differences in sequencing depth between samples or spots within a tissue section. Common normalization methods include:
- adjusting for sequencing depth or library size between samples,
- gene length biases,
- differences in sequencing depth,
- amplification biases,
- or technical variability,
to minimize the impact of highly expressed genes or features on normalization, batch effects, and to stabilize variance, reduce skewness. In addition to that, they also improve the distributional properties of the data. Normalization are typically performed using specialized libraries or tools. For example, in R, Bioconductor packages like Spatial Experiement, STutility or Seurat may be used for normalization of spatial transcriptomics data. In Python, libraries such as Scanpy, AnnData, or custom normalization functions can be used to normalize gene expression counts.
Downstream analysis helps extract meaningful biological insights, identify spatially regulated genes or cell types, unravel spatial patterns, and understand the spatial organization of gene expression within tissues. Tools such as SquidPy helps in spatial autocorrelation analysis, spatial enrichment analysis, spatial neighbourhood analysis, and spatial interaction analysis in order to identify spatial patterns, clusters, cell types, or spatially regulated genes within tissues. SquidPy further provides feature extraction methods such as principal component analysis, uniform manifold approximation and projection, and t-distributed stochastic neighbor embedding for dimensionality reduction and visualization of spatial transcriptomics data. Likewise, RNA Velocity’s algorithms can help model the dynamics of gene expression by considering the ratios of spliced to unspliced RNA counts, accounting for noise, technical variability, and spatial dependencies.
One of the components of downstream analysis is the visualization of spatially resolved gene expression patterns using techniques such as spatial maps, heatmaps, spatial embeddings, spatial networks, and spatial overlays on tissue images in order to understand cell-cell interactions, signaling pathways, spatial coordination, and regulatory networks. Tools like CellxGene provide an interactive environment to explore gene expression patterns, spatial relationships, and cell types within tissues through interactive plots, heatmaps, and scatterplots. Besides spatial maps, web-based tool such as STViewer helps in visualization of gene expression overlays on tissue images, and spot-level gene expression profiles, identify spatial clusters, perform spatial correlation analysis, and explore spatially regulated genes.
In case, you want to have a more granular comparison of these tools- you can read our blog on Visualization Tools for Spatial Transcriptomics Data.
Elucidata’s data harmonization platform Polly harmonizes high-throughput spatial transcriptomics data and single-cell data sourced from in-house assays and diverse public repositories, and stores these datasets in two formats; unfiltered raw counts and custom processed counts. Each dataset is uniquely identified and annotated based on the in-house or public database from which it was ingested. A JSON file containing essential input parameters is then generated which guides subsequent pipeline steps.
Once the pipeline is determined, an h5ad file is generated containing raw counts, spatial coordinates and required image files. Checks are performed to validate data integrity at each step. Polly can then perform differential gene expression analysis by recommending suitable statistical analysis methods; it can help interpret the analysis results, including fold changes, p-values, and adjusted statistics, making the identification and mapping of significant differentially expressed genes easier. Subsequently, the counts are aggregated at the gene level by summing up the transcript-level counts associated with each gene. Finally, an X_spatial embedding is incorporated with the h5ad file’s obsm slot in order to enable spatial visualization within CellxGene. Polly supports sources from Gene Expression Omnibus, Single Cell Portal, Zenodo, CZI-CellxGene, and other publications.
To conclude, Polly provides an innovative solution for the analysis and visualization of spatial transcriptomics data, a path-breaking approach to gather invaluable insights into spatial gene expression and molecular biology. Polly’s user-friendly data harmonization platform helps integrate spatial datasets from diverse in-house and public sources, simplify data retrieval and processing thereby helping researchers explore spatial gene expression patterns and their implications for disease mechanisms and therapeutic interventions.
You can read more about our spatial efforts here or reach out to us at info@elucidata.io for more info.