# Application of Gaussian process to bioinformatics

## About this article

This is the 25th day article of Bioinformatics Advent Calendar 2019.

I read "Gaussian Process and Machine Learning" (Mochihashi Ohba, 2019) and became curious about how Gaussian processes are applied to bioinformatics, so I summarized what I studied.

A Gaussian process is a type of probabilistic generative model. For definitions, derivations, equation transformations, estimation algorithms, etc. of Gaussian processes, please read "Gaussian Process and Machine Learning" (Mochihashi Ohba 2019). Since there are good quality materials on the Internet about the Gaussian process in general (e.g. lecture materials by Professor Mochihashi), I would like 🙇 ♀️ to refrain from explaining the Gaussian process in this article.

## Application in transcriptional control analysis

Neil D. Lawrence and Antti Honkela have actively adopted the Gaussian process in response to the problem of estimating the "time series of activity" of known regulators (such as transcription factors) when given time series data on gene expression.

Lawrence+, Adv. Neural Inf. Process. Syst., 2007 modeled a time series of TF activity (concentration) as a Gaussian process, and the observational model formulated an observation model of the expression level of the target gene during linear and nonlinear differential equations. As a result, it is claimed that when time series data on gene expression levels are given, it was possible to efficiently calculate the susceptibility of known target genes. In Gao+, Bioinformatics, 2008, we proposed formulations in the case of Michaelismenten and transcriptional suppression.

Honkela+, PNAS, 2010 proposes to estimate the target gene of transcription factors from gene expression data alone (Prioritization is the goal). In addition, Titsias+, BMC Syst. Biol., 2012 proposes a method for extending the problem setting when there are multiple transcription factors and estimating the target gene of the transcription factor.

### Example of [Gao+, Bioinformatics, 2008]

For example, Gao+, Bioinformatics, 2008 applies the Gaussian process to the problem of estimating the time series of upstream regulator activity given the transcriptional control relationship with the time series data of mRNA quantities.

Suppose that the mRNA amount of gene $j$ at time $t$ $x_j(t)$ is activated by an upstream transcription factor. This time

- Baseline expression level of gene $j$ $B_j$
- The magnitude of the effect of upstream transcription factor on gene $j$ $S_j$
- mRNA degradation rate of gene $j$ $D_j$
- Upstream transcription factor time $t$ activity at $f(t)$

If we express transcriptional activation as a linear differential equation, the observation model of $x_j(t)$ is as follows.

$$x_j(t) = \\frac{B_j}{D_j} + S_j \\int_{0}^{t} e^{-D_j(t-u)} f(u) du$$

Now model $f(t)$ in the Gaussian process. that is

$$f(t) \\sim N(0, K)$$

Estimate $B_j$, $S_j$, $D_j$, and After estimating the parameters, the predicted distribution of $f(t)$

In addition, we can obtain the mRNA amount of gene $j$ at all times, including the time when not sampling, $x_j(t)$.

The authors argue that the merits of modeling by Gaussian processes are that the number of parameters can be reduced (by not discretizing the function of time) and that it is also open to nonlinearization (Michaerysmenten equation, suppression model).

## ChIP-seq Application to Data Analysis

Neil D. Lawrence and Antti Honkela have also introduced the Gaussian process to the analysis of Pol II ChIP-seq.

In wa Maina+, PLoS Comput. Biol., 2014, we applied a convolved Gaussian process that simultaneously considered covariance in time direction and spatial (bin on the genome) report to signals to gene bodies in Pol II ChIP-seq time series data. This allows us to estimate transcriptional velocity and activity from the dynamics of the Pol II signal, which reflect transcription initiation and elongation activity, and classify genes according to transcriptional dynamics. It is interesting to consider not only the time direction, but also the spatial direction at the same time.

Honkela+, PNAS, 2015 proposes an approach to model the time change in RNA transcriptional activity and mRNA production delay in the Gaussian process. By quantifying the time change in transcriptional activity and mRNA expression level from the time series data of Pol II ChIP-seq and RNA-seq, respectively, as input data, and estimating RNA production delay,

## Application to single-cell RNA-seq analysis

### Application to pseudo-time estimation

PLOS Comput Biol. 2016, Reid+, Bioinformatics, 2016, Ahmed+, Bioinformatics, 2018 propose a method of simulating the time of cells using GPLVM.

Boukouvalas+, Genome Biology, 2018 proposes a method called BGP (branching Gaussian process). It assumes a time series mixed model and estimates the branching pattern for each gene when the pseudotime of the cell and a rough branching pattern are given in advance.

### Application to analysis of intercellular dispersion of gene expression

Buettner+, Nature Biotechnology, 2015 proposes a method called scLVM, which applies GPLVM to scRNA-seq data analysis. In scLVM, assuming that the cell-to-cell variation of each gene can be explained by differences in the values of latent factors that vary from cell to cell, scLVM performs Gaussian process regression on all genes to examine the contribution of each latent factor or to remove the influence of specific latent factors (such as cell cycles).

First, GPLVM is used to estimate the values of the latent variables for each cell (and their intercellular covariance matrices). Fits a low-rank covariance matrix for a preset group of marker genes. Specifically, the number of cells is $N$, a latent factor (such as the cell cycle) is $h$, a group of marker genes related to $h$ is $g_h$, $\\mathbf{Y_h}$ is a matrix of $N\times |g_h|$ representing the expression level in each cell, $\\mathbf{C}$ is a matrix of $N \\times Q$ representing $Q$ known covariates per cell, $\\mathbf{X}$ is a matrix of $N \\times K$ representing $K$ unknown covariates per cell, and $\\mathbf{Y_h}$ is represented by the following linearly generation model:

$$\\mathbf{Y_h} = \\mu + \\mathbf{C}\\mathbf{U} + \\mathbf{X}\\mathbf{W} + \\mathbf{\\psi}$$

Now suppose that the prior distributions of known and unknown covariates weights $\\mathbf{U}$ and $\\mathbf{W}$ are independent normal distributions.

$$p(\\mathbf{C}) = \\prod_{q=1}^{Q} N(\\mathbf{u_q}|0, \\sigma_u^2 \\mathbf{I})$$ $$p(\\mathbf{W}) = \\prod_{k=1}^{Q} N(\\mathbf{w_k}|0, \\mathbf{I})$$

Now, by marginalizing the likelihood of $\\mathbf{Y_h}$ for $\\mathbf{C}$ and $\\mathbf{W}$, we get a representation of the following Gaussian process:

$$p(\\mathbf{Y_h}|\\sigma^2_u, \ u^2, \\mathbf{C},\\mathbf{X}) = \\prod_{g=1}^{|g_h|} N(\\mathbf{y_g}|\\mu_g\\mathbf{1}, \\sigma^2_u\\mathbf{C}\\mathbf{C}^{\\mathrm{T}} + \\mathbf{X}\\mathbf{X}^{\\mathrm{T}} + \ u \\mathbf{I} )$$

Data $\\mathbf{Y_h}$ and $\\mathbf{C}$ are given, so $\\sigma^2_u$, $\ Estimate u^2$, $\\mathbf{X}$.

Using $\\Sigma_h = \\mathbf{X}\\mathbf{X}^{\\mathrm{T}}$, which is estimated here, scLVM further examines the contribution of each potential factor to the intercellular variability of each gene and removes the effects of certain latent factors (such as the cell cycle) from expression levels.

To summarize, GPLVM obtains a per-cell latent variable ($\\mathbf{X}$) for a specific group of genes associated with a latent factor (such as a cell cycle-related gene), and performs Gaussian process regression on all genes based on the covariance matrix of the latent variable ($\\mathbf{\\Sigma}_h$) to examine the contribution for each latent factor. Removes the effects of certain latent factors (such as the cell cycle).

The f-scLVM proposed by the same authors in Buettner+, Genome Biology, 2017 uses a non-GPLVM-free approach to estimating potential variables in cells (albeit with similar formulations). Incidentally, Oliver Stegle , the final author of the scLVM and f-scLVM papers, is a co-author with Neil D. Lawrence in Fusi+, PLoS Comput. Biol., 2012, which will be discussed later.

## Applications in spatial transcriptome data

Recently, a technique in the genre of spatial transcriptome has emerged. This is a case where two-dimensional and three-dimensional spatial coordinates are linked to gene expression level data. Experimental technically, there are imaging-based ones (MERFISH, SeqFISH+, etc.) and sequencing bases (e.g. Visium).

In its application to spatial transcriptome data, a method of interpreting the contributions of each kernel has been proposed in a framework similar to multi-kernel learning.

For example, Svensson+, Nature Methods, 2018 proposed a method called SpatialDE. In SpatialDE, the expression level at each coordinate in space is modeled by the Gaussian process for each gene.

In this case, the covariance matrix is the sum of $\\sigma_s^2 \\cdot\\mathbf{\\Sigma}$, which consists of a kernel that grows closer to each other in coordinates, and $\\sigma_s^2 \\cdot\delta \\cdot \\mathbf{I}$, which is a covariance matrix independent between coordinates.

$$P(y|\\mu,\\sigma_s^2,\\delta,\\mathbf{\\Sigma})=N(y|\\mu \\cdot \\mathbf{1}, \\sigma_s^2 \\cdot (\\mathbf{\\Sigma} + \\delta \\cdot \\mathbf{I}))$$

As a result, SpatialDE can search for genes that fluctuate while having a pattern of "close distance, the expression level becomes closer".

Arnol+, Cell Reports, 2018, which is from the same group as SpatialDE, proposes a method called SVCA. In SVCA, the covariance matrix of the Gaussian process is the weighted sum of the kernels of intrinsic factors (such as cell type), cell-cell interactions, and environmental factors (the closer the distance, the closer the expression level becomes). By estimating these weights, we estimate what factors cause each gene to fluctuate.

$$P(Y) = N(0, K_{intrinsic} + K_{cell-cell interation} + K_{environment} + \\delta^2_{\\epsilon} I_n)$$ $$ K_intrinsic = \\sigma_I^2 XX^{T} $$ $$ K_{cell-cell interation} = \\sigma_{C-C}^2 ZXX^{T}Z^{T} \\mathrm{where} Z_{i,j} = exp(\\frac{-d_{i,j}^2}{2l^2}) $$ $$ K_{environment} = \\sigma_{E} exp{\\frac{-d_{i,j}^2}{2l^2}} $$

## Application to genetic statistics

Fusi+, PLoS Comput. Biol., 2012 proposes a PANAMA model that applies Gaussian process regression to eQTL (expression QTL) analysis. In this model, the gene expression level in each sample is determined by $K$ of SNPs (represented by the matrix $S$ of $N \times K$) and Q latent factors (represented by the matrix $X$ of $N \times Q$ (expressed in $ below).

$$Y=\\mathbf{\\mu} + \\mathbf{S}\\mathbf{V} + \\mathbf{X}\\mathbf{W} + \\mathbf{\\epsilon}$$

The authors argue that while existing methods for removing the effects of confounding factors pose a risk of confusing trans eQTL with confounding factors, the proposed method increases reproducibility between cohorts by simultaneously estimating confounding factors and trans eQTL.

Although the paper does not explicitly describe it as a Gaussian process, if you follow Matemeso, (1) set a linear regression model as a Gaussian distribution, (2) assume that the prior distribution of the weight parameters is a Gaussian distribution, and (3) marginalize the weight parameters to obtain a multivariate Gaussian distribution, which is a typical Gaussian process regression.

## Application to DNase-seq analysis

DNase I-seq is an experimental technology that measures chromatin accessibility in each region of the genome by performing DNA sequencing after DNase I is subjected to genomic DNA.

It is known that when the distribution of leads (or their endpoints) is accumulated with respect to the position of the motif appearance of a transcription factor on the genome, the lead distribution according to the position within the motif is shown. Such a phenomenon is called DNase I footprint, and it is used to estimate transcription factor binding sites from DNase-seq to see if transcription factor binding may occur at places that exhibit a specific lead distribution at such motif appearance locations.

Sherwood+, Nat. Biotechnol., 2014 proposes a method called PIQ, which uses the Gaussian process to estimate transcription factor binding using DNase-seq data. PIQ Deja, the model for generating leads at a position of $i$ relative to the motif appearance position on the genome, is formulated as follows:

First, as an observational model, assume that the number of leads at position $i$ $x_i$ follows the Poisson distribution:

$$x_i \\sim Poisson(\\exp(\\mu_i))$$

At this time, assume that ${\\mu_i}$ follows the Gaussian process:

$$\\mu_i \\sim N(\\mu_0, \\Sigma)$$ $$\\Sigma_{(i,j)} = \\sigma_0 \\cdot k_{|i-j|} $$

Furthermore, at the motif appearance position $m$, we introduced a latent variable $I_m$ (1 if combined, 0 if not) that represents the two states of the transcription factor being bound and unattached, and the value of $I_m$ makes $\\mu_i$ different depending on the value of _m$ It is a mixed model. Then, by estimating the posterior probabilities of $I_m$, we determine whether the transcription factor is bound to $m$.

PIQ is used not only for transcriptional factor binding prediction, but also for the application of the results from time series DNase-seq data to reveal so-called pioneer factors.

## Application to array design

Jokinen+, Bioinformatics, 2018 proposes a pipeline mGPfusion that predicts changes in protein stability due to mutations supplemented by a small number of experimental data and the results of numerous molecular simulations. In this context, we are constructing a prediction model using the Gaussian process.

## Summary

The application of Gaussian processes to bioinformatics was introduced only in an overview.

At the moment, I feel that there are many approaches such as performing variation analysis by Gaussian process regression and estimating unknown (confounder) factors by GPLVM by using kernels that express the similarity of time, genome coordinates, space, pseudo-time, and transcriptome.

In the future, I thought that it might be used in other fields such as single-cell multiomics data and analysis of environmental DNA data sampled at high density in time and space. Also, when the effect of the genotype is not well visible due to the epigenetic effect, I thought that it would be possible to separate the genotype effect by expressing the similarity between samples of DNA methylation in the kernel.

However, there are some papers that do not necessarily answer the question of "Does it have to be a Gaussian process?", so I thought that I should think more deeply about what kind of problem setting is suitable for the Gaussian process (I myself have not yet grasped this point, so I would like to study more).