Example analysis
This is an example workflow demonstrating all the different functionalities the Sainsc.jl
package offers.
General workflow
We start by loading the package and defining the path of some example data.
using Sainsc
using ImageTransformations
using ImageShow
using Printf
data_path = "./data";
stereoseq_file = joinpath(data_path, "Mouse_brain_Adult_GEM_bin1.tsv.gz");
signature_file = joinpath(data_path, "signatures.csv");
color_file = joinpath(data_path, "celltype_colors.json");
Loading data
We read the data from file using readstereoseq
. The input will usually be a GEM file of a StereoSeq experiment. The output is a GridCounts
of gene counts stored as SparseArrays.SparseMatrixCSC
. The input data stems from the original Stereo-seq publication and can be downloaded here.
counts = readstereoseq(stereoseq_file)
GridCounts{String,Int16} (10500, 13950) with 26177 genes
We also load pre-defined cell-type signatures
using CSV
using DataFrames
# read pre-determined cell-type signatures
signatures = CSV.read(signature_file, DataFrame);
celltypes = signatures.Celltype;
select!(signatures, Not(:Celltype));
# remove genes that are not detected in Stereo-seq experiment
signatures = signatures[:, names(signatures) .∈ [keys(counts)]];
print(signatures[1:5, 1:8])
5×8 DataFrame
Row │ 2010300C02Rik Acsbg1 Acta2 Acvrl1 Adamts2 Adamtsl1 Adgrl4 Aldh1a2
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────────
1 │ 0.00168 0.00719 0.0161 0.0168 0.0022 0.000558 0.0167 0.0121
2 │ 0.00559 0.0261 0.00108 0.000594 0.00132 0.000643 0.00124 0.000383
3 │ 0.00562 0.00356 0.00148 0.00021 0.000424 0.000808 0.000479 0.000257
4 │ 0.0191 0.00297 0.00177 0.000292 0.00188 0.000323 0.00043 0.000275
5 │ 0.0155 0.00467 0.00224 0.000508 0.000446 0.00248 0.000732 0.000257
We can have a look at the overall structure by calculating the totalrna
which is basically the sum across all genes per pixel.
total_counts = totalrna(counts);
simshow(imresize(total_counts; ratio=1 / 45))
Kernel
Next, we define a gaussiankernel
to use for smoothing and thus integrating the gene expression locally across pixels. The relevant parameters are the abndwidth of the kernel and the radius i.e. how large the kernel will be measured in bandwidths. For example, setting the bandwidth $bw=8$ and the radius $r=2$ will result in a kernel of size $2r*bw+1=33$.
We can also convert the kernel to Float32
to reduce memory usage later on.
kernel = gaussiankernel(8, 2);
kernel = convert.(Float32, kernel);
We can evaluate the effects by e.g. applying the kernel to the totalRNA.
total_rna = kde(Matrix(totalrna(counts)), kernel);
simshow(imresize(total_rna; ratio=1 / 45))
Local Maxima
The smoothed totalRNA is used to find local maxima of that can be used as cell proxies to identify gene signatures or for further analysis.
findlocalmaxima
allows you to specify a minimum distance between to neighboring local maxima.
lm = findlocalmaxima(total_rna, 6);
println("$(length(lm)) local maxima detected")
39238 local maxima detected
Once we identified the local maxima we can load them for a defined set of genes. getlocalmaxima
allows you to load the maxima in a variety of output types. Have a look at the existing Sainsc.jl
extensions. Here we are going to load them as Muon.AnnData
object.
using Muon
adata = getlocalmaxima(AnnData, counts, lm, kernel; genes=names(signatures))
AnnData object 39238 ✕ 247
Cell-type map
Once we have identified gene signatures from literature, previous experiments, or by analysing local maxima we can assign a celltype to each pixel. This is usually the most time-intensive processing step. assigncelltype
returns 3 matrices; a map of assigned celltypes of each pixel as CategoricalArrays.CategoricalMatrix
, the cosine similarity of the assigned cell type for each pixel, and the assignment score as a confidence of the cell type assignment.
celltype_map, cosine_sim, assignment_conf = assigncelltype(
counts, signatures, kernel; celltypes=celltypes
);
In the final step we can visualize the cell-type map.
using Colors
using JSON
# generate the lookup-table for cell-type colors
lut = Dict(Iterators.map(((k, v),) -> k => RGB(v...), pairs(JSON.parsefile(color_file))));
background_color = RGB(0, 0, 0);
celltype_img = map(x -> get(lut, x, background_color), celltype_map);
simshow(imresize(celltype_img; ratio=1 / 45))
When visualizing the cell-type map, it might be beneficial to remove background noise. A histogram of the totalRNA can be helpful to define a threshold.
using SparseArrays
using Plots
histogram(nonzeros(sparse(total_rna)); title="KDE of totalRNA", bins=200)
celltype_img[total_rna .< 1.5] .= background_color;
simshow(imresize(celltype_img; ratio=1 / 45))
The assignment score can be useful to evaluate the confidence of assignment.
simshow(imresize(Gray.(assignment_conf); ratio=1 / 45))
Utils
Some additional useful functions are included in Sainsc.jl
.
Cropping & Masking
crop!
allows you to slice the counts across all genes in x-y dimension.
crop!(counts, (4_001:7_000, 6_001:9_000));
simshow(imresize(kde(totalrna(counts), kernel); ratio=1 / 10))
mask!
allows you to filter the field of view by an arbitrary binary mask.
# create a mask
roi_mask = zeros(Bool, 3_000, 3_000)
roi_mask[1:1_500, 1_501:3_000] .= true
roi_mask[1_501:3_000, 1:1_500] .= true
mask!(counts, roi_mask);
simshow(imresize(kde(totalrna(counts), kernel); ratio=1 / 10))
Binning
The same extensions that are available to load local maxima are also available for reading the StereoSeq data into bins using readstereoseqbinned
. This can be useful for e.g. identifying gene signatures from binned data, or detecting highly variable genes that can be used for analysis. We again here demonstrate it using Muon.AnnData
.
adata = readstereoseqbinned(AnnData, stereoseq_file, 50)
AnnData object 38741 ✕ 26177