# Introduction

Comparing samples defined over a single dimension is a straightforward task that relies on standard, well established methods. Meanwhile distance between samples in high dimensional space remains a largely unexplored field. Available solutions rely on multivariate normal distributions, a condition that is both difficult to check and overlooking key behaviors in biological samples, where populations of interest often correspond to a small proportion (<1%) of all the points that have been measured.

We have developed hilbertSimilarity to address the problem of sample similarity in mass cytometry where samples are measured over up to 100 dimensions, and where each sample contains a slightly different number of points (or cells). Our method first transforms each sample into a probability vector, by dividing each dimension into a fixed number of bins and by associating each cell to a specific multidimensional cube. The proportion of cells in each hypercube is characteristic of a given sample. To characterize an compare samples we use measures from Information Theory, since their interpretation in terms of similarity and complexity is straightforward.

After determining sample similarity, our method can also be used to determine which cells are different between samples, by comparing the number of cells in each hypercube to a reference treatment. Since every sample contains a different number of cells, we use a bootstrap approach where the same number of cells is sampled from each treatment to allow for a direct comparison.

To demonstrate the power of hilbertSimilarity we applied the method to a subset of the bodenmiller et al. dataset, comparing the effect of different stimulations and identifying groups of cells that are significantly affected by different treatments.

Compared to other methods, hilbertSimilarity does not rely on expert-driven gating, or require any hypothesis about the number of populations that are present in the data. This makes it a powerful tool to quickly assess a dataset before using traditional methods, or when populations a not known a priori.

# Installation

hilbertSimilarity can be installed using the following command

devtools::install_github(yannabraham/hilbertSimilarity)

Once installed the package can be loaded using the standard library command.

library(hilbertSimilarity)

We first need data from the bodenmiller package:

library(bodenmiller)
data(refPhenoMat)
data(untreatedPhenoMat)
data(refFuncMat)
data(untreatedFuncMat)
data(refAnnots)
refAnnots$Treatment <- 'reference' data(untreatedAnnots) fullAnnots <- rbind(refAnnots[,names(untreatedAnnots)], untreatedAnnots) fullAnnots$Treatment <- factor(fullAnnots$Treatment) fullAnnots$Treatment <- relevel(fullAnnots$Treatment,'reference') refMat <- cbind(refPhenoMat,refFuncMat) untreatedMat <- cbind(untreatedPhenoMat, untreatedFuncMat) fullMat <- rbind(refMat,untreatedMat) fullMat <- apply(fullMat,2,function(x) { x[x<0] <- 0 return(x) } ) In this dataset 51936 cells corresponding to 4 have been measured over 23 channels. Cells have been gated into 14 populations. The following treatments are compared to one another: reference BCR/FcR-XL PMA/Ionomycin vanadate cd14-hladr- 22 18 25 27 cd14-hladrhigh 60 19 4 1 cd14-hladrmid 244 149 54 18 cd14-surf- 1807 1845 1520 1824 cd14+hladr- 13 17 15 91 cd14+hladrhigh 21 3 5 1 cd14+hladrmid 1373 839 298 155 cd14+surf- 81 77 214 166 cd4+ 3989 4076 4546 503 cd8+ 5068 5098 5122 1200 dendritic 58 30 28 23 igm- 232 330 181 90 igm+ 1069 863 992 395 nk 1755 1888 1572 1822 The smallest cell population, cd14+hladrhigh, corresponds to 0.058% of the total cells. # Computing the similarity between treatments We will first compute the similarity between treatments, by generating a 3rd-order Hilbert curve using combined cuts: • computing the cuts # compute the nbins <- 3 cuts <- make.cut(fullMat, n=nbins+1, count.lim=40) ## CD20 ## IgM ## CD4 ## CD33 ## HLA.DR ## CD14 ## CD7 ## CD3 ## CD123 ## pStat1 ## pSlp76 ## pBtk ## pPlcg2 ## pErk ## pLat ## pS6 ## pNFkB ## pp38 ## pStat5 ## pAkt ## pSHP2 ## pZap70 ## pStat3 • cutting the matrix cutFullMat <- do.cut(fullMat,cuts,type='combined') • ordering the dimensions library(entropy) miFullMat <- matrix(NA,nrow=ncol(fullMat),ncol = ncol(fullMat) ) for (i in seq(ncol(fullMat)-1)) { for (j in seq(i+1,ncol(fullMat))) { cur.tbl <- table(cutFullMat[,i],cutFullMat[,j]) nent <- 1-mi.empirical(cur.tbl)/entropy.empirical(cur.tbl) miFullMat[i,j] <- miFullMat[j,i] <- nent } } dimnames(miFullMat) <- list(colnames(fullMat),colnames(fullMat)) hcFullMat <- hclust(as.dist(miFullMat)) • generating the Hilbert curve hc <- do.hilbert(cutFullMat[,rev(hcFullMat$order)],2)
• Visualizing the different samples
library(reshape2)
library(dplyr)
library(ggplot2)
treatment.table <- with(fullAnnots,
table(hc,Treatment))
treatment.pc <- apply(treatment.table,2,function(x) x/sum(x))
av <- andrewsProjection(t(treatment.pc),breaks=40)
treatment.proj <- t(treatment.pc) %*% t(av$freq) melt(treatment.proj) %>% rename(AndrewsIndex=Var2) %>% mutate(AndrewsIndex=av$i[AndrewsIndex]) %>%
ggplot(aes(x=AndrewsIndex,y=value))+
geom_line(aes(group=Treatment,color=Treatment))+
theme_light(base_size=16)+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())

# Identifying treatment effects

For the comparison to be meaningful, we need to limit the analysis to bins that contain a minimum number of cells. This number is arbitrary, and represents the smallest group of cells that should be called a population. We also need to define the number of bootstrap, the significant fold change, and the limit of significance:

# total cells per Hilbert index per treatment
treatment.table <- with(fullAnnots,
table(hc,Treatment))
# minimum number of cells
lim <- 40
# number of bootstraps
N <- 400
# significant fold change
flim <- 2
# limit of significance
p <- 0.95
# bins with enough cells
boot.ind <- rownames(treatment.table)[which(rowSums(treatment.table)>=lim)]
# number of cells in each bootstrap
ncells <- floor(min(colSums(treatment.table[boot.ind,]))/1000)*1000

For this exercise we use 40 as the minimum population size. 218 indices out of 7114 contain enough cells.

Because each treatment corresponds to a different number of cells, we will use a bootstrap approach to identify indices that are significantly different between conditions. we will bootstrap 2000 cells from the selected bins, then check which bin contains at least 2 times more cells in the reference compared to other treatments. This operation will be repeated 400 times, and any bin that is consistently different in 380 bins will be considered significant.

library(abind)
boot.mat <- lapply(seq(1,N),function(x) {
if (x%/%100==x/100) cat(x,'\n')
# bootstrap hilbert indices
cur.boot <- sample(which(hc %in% boot.ind),
size=ncells,
replace=TRUE)
return(table(factor(hc[cur.boot],levels=boot.ind),
droplevels(fullAnnots[cur.boot,'Treatment'])))
}
)
## 100
## 200
## 300
## 400
boot.mat <- abind(boot.mat,along=3)

boot.log <- log2(boot.mat)
boot.fold <- sweep(boot.log[,-1,],c(1,3),boot.log[,1,])
boot.sign <- sign(boot.fold)
boot.sign[abs(boot.fold)<log2(flim)] <- 0
boot.res <- apply(boot.sign,c(1,2),function(x) table(sign(x))[c('-1','0','1')])
boot.res <- aperm(boot.res,c(2,1,3))
boot.res[is.na(boot.res)] <- 0
dimnames(boot.res)[[2]] <- c('-1','0','1')

After the analysis, the following number of bins are found significant:

sig.res <- apply(boot.res[,c('-1','1'),]>(N*p),c(1,3),any)
sum.sig.res <- cbind(colSums(sig.res),
colSums(treatment.table[boot.ind[rowSums(sig.res)>0],colnames(sig.res)]))

p.cells.sig.res <- lapply(colnames(sig.res),
function(cur.cl) {
x <- treatment.table[boot.ind,cur.cl]
return(sum(x[sig.res[,cur.cl]])/sum(x))
}
)
names(p.cells.sig.res) <- colnames(sig.res)
sum.sig.res <- cbind(sum.sig.res,
round(100*unlist(p.cells.sig.res),0))
colnames(sum.sig.res) <- c('Number of indices',
'Number of Significant cells',
'Percent Significant Cells')
kable(sum.sig.res)
Number of indices Number of Significant cells Percent Significant Cells
BCR/FcR-XL 3 5788 1
PMA/Ionomycin 47 4129 46

Focusing on BCR/FcR-XL, which are the 3 indices that were found significant?

kable(treatment.table[boot.ind[sig.res[,'BCR/FcR-XL']],])
2098176 1 97 32 0
4194304 69 3 7 13
6291455 198 14 23 48

Compared to the unstimulated sample, cells are decreasing in 2 bins and increasing in a 3rd one. Based on manual gating, what do these bins correspond to?

cur.treatment <- 'BCR/FcR-XL'
sig.cells <- with(fullAnnots,table(droplevels(Cells[hc %in% boot.ind[sig.res[,cur.treatment]]])))
total.sig.cells <- with(fullAnnots,table(droplevels(Cells[Cells %in% names(sig.cells) &
hc %in% boot.ind])))
treatment.sig.cells <- with(fullAnnots,
table(droplevels(Cells[hc %in% boot.ind[sig.res[,cur.treatment]] &
Treatment==cur.treatment])))
treatment.total.cells <- with(fullAnnots,
table(droplevels(Cells[Cells %in% names(sig.cells) &
hc %in% boot.ind &
Treatment==cur.treatment])))
sum.sig.cells <- cbind(sig.cells,
total.sig.cells,
round(100*sig.cells/total.sig.cells),
treatment.sig.cells,
treatment.total.cells,
round(100*treatment.sig.cells/treatment.total.cells))

colnames(sum.sig.cells) <- c('Number of Significant Cells',
'Total number of Cells',
'Percent of Total',
paste('Number of Signigicant Cells in',cur.treatment),
paste('Number of Cells in',cur.treatment),
paste('Percent of',cur.treatment))
kable(sum.sig.cells)
Number of Significant Cells Total number of Cells Percent of Total Number of Signigicant Cells in BCR/FcR-XL Number of Cells in BCR/FcR-XL Percent of BCR/FcR-XL
igm- 30 288 10 23 119 19
igm+ 475 1373 35 91 320 28

All the cells that are found significant are B cells, and they represent a large fraction of all B cells in the BCR/FcR-XL sample, even when only taking the bootstrapped indices into account.