Example analysis

This is an example workflow demonstrating all the different functionalities the Sainsc.jl package offers.

General workflow

We start by loading the package and defining the path of some example data.

using Sainsc

using ImageTransformations
using ImageShow
using Printf

data_path = "./data";

stereoseq_file = joinpath(data_path, "Mouse_brain_Adult_GEM_bin1.tsv.gz");
signature_file = joinpath(data_path, "signatures.csv");
color_file = joinpath(data_path, "celltype_colors.json");

We read the data from file using readstereoseq. The input will usually be a GEM file of a StereoSeq experiment. The output is a GridCounts of gene counts stored as SparseArrays.SparseMatrixCSC. The input data stems from the original Stereo-seq publication and can be downloaded here.

counts = readstereoseq(stereoseq_file)

GridCounts{String,Int16} (10500, 13950) with 26177 genes

We also load pre-defined cell-type signatures

using CSV
using DataFrames

# read pre-determined cell-type signatures
signatures = CSV.read(signature_file, DataFrame);
celltypes = signatures.Celltype;
select!(signatures, Not(:Celltype));

# remove genes that are not detected in Stereo-seq experiment
signatures = signatures[:, names(signatures) .∈ [keys(counts)]];

print(signatures[1:5, 1:8])

5×8 DataFrame
 Row │ 2010300C02Rik  Acsbg1   Acta2    Acvrl1    Adamts2   Adamtsl1  Adgrl4    Aldh1a2
     │ Float64        Float64  Float64  Float64   Float64   Float64   Float64   Float64
─────┼───────────────────────────────────────────────────────────────────────────────────
   1 │       0.00168  0.00719  0.0161   0.0168    0.0022    0.000558  0.0167    0.0121
   2 │       0.00559  0.0261   0.00108  0.000594  0.00132   0.000643  0.00124   0.000383
   3 │       0.00562  0.00356  0.00148  0.00021   0.000424  0.000808  0.000479  0.000257
   4 │       0.0191   0.00297  0.00177  0.000292  0.00188   0.000323  0.00043   0.000275
   5 │       0.0155   0.00467  0.00224  0.000508  0.000446  0.00248   0.000732  0.000257

We can have a look at the overall structure by calculating the totalrna which is basically the sum across all genes per pixel.

total_counts = totalrna(counts);
simshow(imresize(total_counts; ratio=1 / 45))

Kernel

Next, we define a gaussiankernel to use for smoothing and thus integrating the gene expression locally across pixels. The relevant parameters are the abndwidth of the kernel and the radius i.e. how large the kernel will be measured in bandwidths. For example, setting the bandwidth $bw=8$ and the radius $r=2$ will result in a kernel of size $2r*bw+1=33$.

We can also convert the kernel to Float32 to reduce memory usage later on.

kernel = gaussiankernel(8, 2);
kernel = convert.(Float32, kernel);

We can evaluate the effects by e.g. applying the kernel to the totalRNA.

total_rna = kde(Matrix(totalrna(counts)), kernel);
simshow(imresize(total_rna; ratio=1 / 45))

Local Maxima

The smoothed totalRNA is used to find local maxima of that can be used as cell proxies to identify gene signatures or for further analysis.

findlocalmaxima allows you to specify a minimum distance between to neighboring local maxima.

lm = findlocalmaxima(total_rna, 6);
println("$(length(lm)) local maxima detected")

39238 local maxima detected

Once we identified the local maxima we can load them for a defined set of genes. getlocalmaxima allows you to load the maxima in a variety of output types. Have a look at the existing Sainsc.jl extensions. Here we are going to load them as Muon.AnnData object.

using Muon

adata = getlocalmaxima(AnnData, counts, lm, kernel; genes=names(signatures))

AnnData object 39238 ✕ 247

Cell-type map

Once we have identified gene signatures from literature, previous experiments, or by analysing local maxima we can assign a celltype to each pixel. This is usually the most time-intensive processing step. assigncelltype returns 3 matrices; a map of assigned celltypes of each pixel as CategoricalArrays.CategoricalMatrix, the cosine similarity of the assigned cell type for each pixel, and the assignment score as a confidence of the cell type assignment.

celltype_map, cosine_sim, assignment_conf = assigncelltype(
    counts, signatures, kernel; celltypes=celltypes
);

In the final step we can visualize the cell-type map.

using Colors
using JSON

# generate the lookup-table for cell-type colors
lut = Dict(Iterators.map(((k, v),) -> k => RGB(v...), pairs(JSON.parsefile(color_file))));

background_color = RGB(0, 0, 0);

celltype_img = map(x -> get(lut, x, background_color), celltype_map);
simshow(imresize(celltype_img; ratio=1 / 45))

When visualizing the cell-type map, it might be beneficial to remove background noise. A histogram of the totalRNA can be helpful to define a threshold.

using SparseArrays
using Plots

histogram(nonzeros(sparse(total_rna)); title="KDE of totalRNA", bins=200)

celltype_img[total_rna .< 1.5] .= background_color;

simshow(imresize(celltype_img; ratio=1 / 45))

The assignment score can be useful to evaluate the confidence of assignment.

simshow(imresize(Gray.(assignment_conf); ratio=1 / 45))

Utils

Some additional useful functions are included in Sainsc.jl.

Cropping & Masking

crop! allows you to slice the counts across all genes in x-y dimension.

crop!(counts, (4_001:7_000, 6_001:9_000));

simshow(imresize(kde(totalrna(counts), kernel); ratio=1 / 10))

mask! allows you to filter the field of view by an arbitrary binary mask.

# create a mask
roi_mask = zeros(Bool, 3_000, 3_000)
roi_mask[1:1_500, 1_501:3_000] .= true
roi_mask[1_501:3_000, 1:1_500] .= true

mask!(counts, roi_mask);

simshow(imresize(kde(totalrna(counts), kernel); ratio=1 / 10))

Binning

The same extensions that are available to load local maxima are also available for reading the StereoSeq data into bins using readstereoseqbinned. This can be useful for e.g. identifying gene signatures from binned data, or detecting highly variable genes that can be used for analysis. We again here demonstrate it using Muon.AnnData.

adata = readstereoseqbinned(AnnData, stereoseq_file, 50)

AnnData object 38741 ✕ 26177