Data description

We included ten public datasets (i.e., R-bench-1-3, C-bench-1-3, and A-bench-1-4) for grid-test benchmarking among DeepMAPS and existing tools and additional five datasets (i.e., R-test-1, C-test-1, and A-test-1-3) for independent test with optimized parameters. The human PBMC and lung tumor leukocyte CITE-seq data and 10× lymph node scRNA-seq & scATAC-seq data were used for the two case studies, respectively. All data are publicly available (Supplementary Data 1 and Data Availability).

Data preprocessing and integration

The analysis of DeepMAPS takes the raw counts matrices of multiple scRNA-seq (multiple gene expression matrices), CITE-seq (gene and surface protein expressions matrices), and scRNA-ATAC-seq (gene expression and chromatin accessibility matrices) data as input. For each data matrix, we define modality representations as rows (genes, proteins, or peak regions) and cells as columns across the paper unless exceptions are mentioned. In each data matrix, a row or a column is removed if it contains less than 0.1% non-zero values. The data quality control is carried out by Seurat v3, including but not limited to total read counts, mitochondrial gene ratio, and blacklist ratio. Additional data preprocessing and integration methods are showcased below.

Multiple scRNA-seq data

Gene expression matrices are log-normalized, and the top 2000 highly variable genes are selected using Seurat v318 from each matrix. If there are less than 2000 genes in the matrix, all of them will be selected for integration. We then apply the widely-used canonical correlation analysis (CCA) in Seurat to align these matrices and harmonize scRNA-seq data, leading to a matrix \(X=\{{x}_{{ij}}|i={1,2},\ldots,{I;j}={1,2},\ldots,\,J)\) for I genes in J cells.

CITE-seq data

Gene and surface protein expression matrices are log-normalized. The top I1 highly variable genes (I1 = 2000) and I2 proteins (I2 is the total number of proteins in the matrix) are selected. The two matrices are then concatenated vertically, leading to a matrix \(X=({x}_{{ij}}|i={{1,2}},\ldots,({{{\rm I}_{1}}}+{{{\rm I}_{2}}}){{;}} \, {{{{{\rm{j}}}}}}={{1,2}},\ldots,\,J)\) for I1 genes and I2 proteins in J cells (can be treated as \({I}_{1}+{I}_{2}=I\) genes in J cells). A centered log-ratio (CLR) transformation is performed on X as follows:

$${{CLR}}(x_{{ij}})={{\log }}\left(1+\left(\frac{{x}_{{ij}}}{\exp \left(\frac{\mathop{\sum}\limits_{i\in {{{{{{\mathscr{Z}}}}}}}_{j}}{{\log }}\left(1+{x}_{{ij}}\right)}{\left|{{{{{{\mathscr{Z}}}}}}}_{j}\right|}\right)}\right)\right)$$

(1)

where \({{{{{{\mathcal{Z}}}}}}}_{j}\) represents the set of indices for non-zero genes in cell j, and |∙| means the number of elements in the set.

scRNA-ATAC-seq data

The gene expression matrix \({X}^{R}=\{{x}_{{ij}}^{R}|i={1,2},\ldots,{I;j}={1,2},\ldots,\,J\}\) with I genes and J cells is log-normalized. Then a left-truncated mixture Gaussian (LTMG) model is used to provide a qualitative representation of each gene over all cells, through the modeling of how underlying regulatory signals control gene expressions in a cell population74. Specifically, if gene i can be represented by Gi Gaussian distributions over all J cells, that means there are potentially Gi regulatory signals regulating this gene. A matrix \({X}^{{R{{\hbox{'}}}}}=\{{x}_{{ij}}^{{R{{\hbox{'}}}}}\}\) with the same dimension as XR can be generated, where the gene expressions are labeled by discrete values of \({x}_{{ij}}^{{R{{\hbox{'}}}}}={1,2},\ldots {G}_{i}\).

The chromatin accessibility matrix is represented as \({X}^{A}=\{{x}_{{kj}}^{A}|k={1,2},\ldots,{K;j}={1,2},\ldots,\,J\}\) for K peak regions in J cells. We annotate peak regions in XA into corresponding genes based on the method described in MAESTRO24. Specifically, a regulatory potential weight wik for peak k to gene i is calculated conditional to the distance of peak k to gene i in the genome:

$${w}_{ik}=\left\{\begin{array}{c}0,\,{d}_{ik} \, > \, 150\,{{{{{\rm{kb}}}}}}\,{{{{{\rm{or}}}}}}\,{{{{{\rm{peak}}}}}}\,{{{{{\rm{k}}}}}}\,{{{{{\rm{located}}}}}}\,{{{{{\rm{in}}}}}}\,{{{{{\rm{any}}}}}}\,{{{{{\rm{nearby}}}}}}\,{{{{{\rm{genes}}}}}}\\ \frac{1}{Length(exon)},\,{{{{{\rm{peak}}}}}}\,k\,{{{{{\rm{located}}}}}}\,{{{{{\rm{at}}}}}}\,{{{{{\rm{the}}}}}}\,{{{{{\rm{exon}}}}}}\,{{{{{\rm{regions}}}}}}\,{{{{{\rm{of}}}}}}\,{{{{{\rm{the}}}}}}\,{{{{{\rm{gene}}}}}}\,j\\ {2}^{-\frac{{d}_{ik}}{{d}_{0}}},\, else\end{array}\right.$$

(2)

where \({d}_{{ik}}\) is the distance between the center of peak \(k\) and the transcription start site of gene \(i\), and \({d}_{0}\) is the half-decay of the distance (set to be 10 kb). The regulatory potential weight \({w}_{{ik}}\) of peak \(k\) to gene \(j\) is normally calculated by \({2}^{-\frac{{d}_{{ik}}}{{d}_{0}}}\). For peaks with \({d}_{{ik}} \, > \, 150{kb}\), \({w}_{{ik}}\) will be less than 0.0005, and thus we set it to 0 for convenience. In MAESTRO, for peaks located in the exon region, \({d}_{0}\) is 0, so that \({w}_{{ik}}\) should be \(1\) according to the formula, and \({w}_{{ik}}\) is normalized by the total exon length of gene \(i\). The reason is that, in bulk ATAC-seq data, it is observed that many highly expressed genes will also have ATAC-seq peaks in the exon regions, mainly due to the temporal PolII and other transcriptional machinery bindings. Based on that observation, to better fit the model with gene expression, MAESTRO added the signals from the exon regions. However, as reads tend to be located in longer exons more easily than shorter exons, to normalize the possibility of background reads, it normalizes the total reads on exons by the total exon length for each gene. Eventually, a regulatory potential score of peak \(k\) to gene \(i\) in cell \(j\) can be calculated as \({r}_{{ik|j}}={w}_{{ik}}{\times x}_{{kj}}^{A}\). The scATAC-seq matrix \({X}^{A}\) can then be transformed into a gene regulatory potential matrix by summing up the regulatory potential scores of peaks that regulate the same gene:

$${x}_{ij}^{A{\prime} }=0+\mathop{\sum}\limits_{k}{r}_{ik|j},$$

(3)

giving rise to the regulatory potential matrix \({X}^{A^{\prime} }=\{{x}_{ij}^{A^{\prime} }|i=1,2,\ldots,I;\,j=1,2,\ldots,\,J\}\) for same \(I\) genes in \(J\) cells with \({X}^{R}\).

We assume that the activity of a gene in a cell is determined by gene expression and gene regulatory activity with different contributions. Unlike the contribution weights determined directly based on the expression and chromatin accessibility values in Seurat v4 (weighted nearest neighbor)5, we hypothesize that the relative contribution of the expression and chromatin accessibility of a gene to a cell is dynamic rather than static and not accurately determined with a snapshot of the cell. RNA velocity is determined by the abundance of unspliced and spliced mRNA in a cell. The amount of unspliced mRNA is determined by gene regulation and gene transcription rate, and the amount of spliced mRNA is determined by the difference between unsliced mRNA and degraded mRNA. We reasoned that for genes with positive RNA velocities in a cell, there are higher potentials to drive gene transcription. Thus, their regulatory activity related to chromatin accessibility has a greater influence than the gene expression in defining the overall transcriptional activity in the cell of the current snapshot. For genes with negative velocities, the transcription rate tends to be decelerated; hence chromatin accessibility has less influence on transcriptional activity than gene expression.

A velocity matrix \({X}^{V}=\{{x}_{{ij}}^{V}|i={1,\, 2},\ldots {I;\, j}={1,\, 2},\ldots,\,J\}\) is generated using scVelo with the default parameters75. Considering that some genes may fail to obtain valid velocity or regulatory potential values, we simultaneously remove the genes that have all-zero rows in \({X}^{{A{{\hbox{'}}}}}\) or \({X}^{V}\) from the four matrices \({X}^{R},\, {X}^{{R{{\hbox{'}}}}},\,{X}^{{A{{\hbox{'}}}}},\,{X}^{V}.\) Without loss of generality, we still use \(I\) and \(J\) represent the size of these new matrices. Furthermore, considering the potential bias when interpreting the velocity of a gene in a cell, we use the LTMG representations \({x}_{{ij}}^{{R{{\hbox{'}}}}}\in \{{{{{\mathrm{1,2}}}}},\ldots,{G}_{i}\}\) to discretize \({x}_{{ij}}^{V}\). For gene \(i\), let \({{{{{{\mathcal{J}}}}}}}_{g}\) be the cell set where gene \(i\) has the same LTMG signal \(g\in \{{{{{\mathrm{1,\, 2}}}}},\ldots,{G}_{i}\}\). For the cells in \({{{{{{\mathcal{J}}}}}}}_{g}\), we use the mean velocity of gene \(i\) in these cells to replace the original velocities. To calculate a velocity weight \(\beta\) of gene \(i\) in cell \(j\), we first extract \({X}_{i}^{V}=({x}_{i1}^{V},\, {x}_{i2}^{V},\ldots {x}_{{iJ}}^{V})\) for the velocity of gene \(i\) in all cells and \({X}_{j}^{V}=({x}_{1j}^{V},\, {x}_{2j}^{V},\ldots {x}_{{I}_{2}j}^{V})\) for the velocity of all genes in cell \(j\). Then, for \({X}_{i}^{V}\), let \({{{{{{\mathcal{X}}}}}}}_{i}^{V+}=\{{x}_{{ij}}^{V}|{x}_{{ij}}^{V} \, > \, 0,j={1,\, 2},\ldots,J\}\) for all cells with positive velocities of gene \(i\) and \({{{{{{\mathcal{X}}}}}}}_{i}^{V-}=\{{x}_{{ij}}^{V}|{x}_{{ij}}^{V} \, < \, 0,j={1,2},\ldots,J\}\) for all cells with negative velocities of gene \(i\). Similarly, for \({X}_{j}^{V}\), let \({{{{{{\mathcal{X}}}}}}}_{j}^{V+}=\{{x}_{{ij}}^{V}|{x}_{{ij}}^{V} \, > \, 0{;i}={1,2},\ldots I\}\) for all genes with positive velocities in cell \(i\) and \({{{{{{\mathcal{X}}}}}}}_{j}^{V-}=\{{x}_{{ij}}^{V}|{x}_{{ij}}^{V} \, < \, 0{;}i={1,2},\ldots I\}\) for all genes with negative velocities in cell \(i\). For \({x}_{{ij}}^{V} \, > \, 0\), rank \({{{{{{\mathcal{X}}}}}}}_{i}^{V+}\) and \({{{{{{\mathcal{X}}}}}}}_{j}^{V+}\) from high to low based on velocity values with the ranking starting from 1 and calculate the velocity weight as:

$${\beta }^{+}=\sqrt{{({{{{{\rm{|}}}}}}{{{{{{\mathcal{X}}}}}}}_{i}^{V+}{{{{{\rm{|}}}}}}-a-1)}^{2}+{\left({{{{{\rm{|}}}}}}{{{{{{\mathcal{X}}}}}}}_{j}^{V+}{{{{{\rm{|}}}}}}-b-1\right)}^{2}}$$

(4)

where \(a\) is the rank of \({x}_{{ij}}^{V}\) in \({{{{{{\mathcal{X}}}}}}}_{i}^{V+}\), b is the rank of \({x}_{{ij}}^{V}\) in \({{{{{{\mathcal{X}}}}}}}_{j}^{V+}\).

Similarly, for \({x}_{{ij}}^{V} \, < \, 0\), rank \({{{{{{\mathcal{X}}}}}}}_{i}^{V-}\) and \({{{{{{\mathcal{X}}}}}}}_{j}^{V-}\) from high to low based on absolute value of velocities with ranking starting from 0 and calculate the velocity weight as:

$${\beta }^{-}=\sqrt{{{{{{\mathcal{X}}}}}}_{j}^{V-} \left|-a-1\right)^{2}+{\left(|{{{{{{\mathcal{X}}}}}}}_{j}^{V-}|-b-1\right)}^{2}}$$

(5)

where \(a\) is the rank of \({x}_{{ij}}^{V}\) in \({{{{{{\mathcal{X}}}}}}}_{i}^{V-}\) and \(b\) is the rank of \({x}_{{ij}}^{V}\) in \({{{{{{\mathcal{X}}}}}}}_{j}^{V-}\).

We now generate a gene activity matrix \({X}^{G}=\{{x}_{{ij}}^{G}\}\), integrating gene expression and chromatin accessibility based on the velocity weight. \({x}_{{ij}}^{G}\) is the gene activity score (GAS) of gene \(i\) in cell \(j\):

$${x}_{{ij}}^{G}=\left\{\begin{array}{c}{x}_{{ij}}^{R}+\left(1+{\beta }^{+}\right){x}_{{ij}}^{{A{{\hbox{'}}}}},\;{{\mbox{for}}}\;{x}_{{ij}}^{V}\, > \,0\\ {x}_{{ij}}^{R}+\left(1-{\beta }^{-}\right){x}_{{ij}}^{{A{{\hbox{'}}}}},\;{{\mbox{for}}}\;{x}_{{ij}}^{V}\, < \,0\\ {x}_{{ij}}^{R}+{x}_{{ij}}^{{A{{\hbox{'}}}}},\;{{\mbox{for}}}\;{x}_{{ij}}^{V}=0\hfill\end{array}\right.$$

(6)

Construction of gene-cell heterogeneous graph

To simplify notations, we now redefine any integrative matrix generated in the previous section as \(X=\{{x}_{{ij}}|i={1,\, 2},\ldots,\, {I;\,j}={1,2},\ldots,J\}\) with \(I\) genes and \(J\) cells. \({x}_{{ij}}\) represents either normalized expressions (for multiple scRNA-seq and CITE-seq) or GAS (for scRNA-ATAC-seq) of gene \(i\) in cell \(j\). We calculate initial embeddings for genes and cells via two autoencoders. We used two autoencoders to generate the initial embeddings for cells and genes, respectively. The cell autoencoder reduces gene dimensions for each cell from \({{{{{\rm{I}}}}}}\) dimensions to 512 dimensions and eventually to 256 dimensions; a gene autoencoder reduces cell dimensions for each gene from \({{{{{\rm{J}}}}}}\) dimensions to 512 and 256 dimensions. So that, each cell and gene have the same initial embedding of 256 dimensions. The number of lower dimensions is optimized as a hyperparameter that differs for each dataset. The output layer is a reconstructed matrix \(\hat{X}\) with the same dimension as \(X\). The loss function of cell autoencoder is the mean squared error (MSE) of \(X\) and \(\hat{X}\):

$${loss}={MSE}(X,\, \hat{X})={\mathop{\sum }\limits_{i}(X-\hat{X})}^{2}$$

(7)

The gene autoencoder learns low dimensional features of genes from all cells, which has an encoder, latent space, and a decoder similar to the cell autoencoder, while the input \({X}^{T}\) is the transposed matrix of \(X\). The loss function of gene autoencoder is

$${loss}={MSE}\left({X}^{T},\, \hat{{X}^{T}}\right)={\mathop{\sum }\limits_{j}\left({X}^{T}-\hat{{X}^{T}}\right)}^{2},$$

(8)

where \(\hat{{X}^{T}}\) is the reconstructed matrix in the output layer with the same dimensions as \({X}^{T}\).

Definition 1 (Heterogeneous graph): A heterogeneous graph is a graph with multiple types of nodes and/or multiple types of edges. We denote a heterogeneous graph as \(G=(V,\, E,\, A,\, R)\), where \(V\) represents nodes, \(E\) represents edges, \(A\) represents the node type union, and \(R\) represents the edge type union.

Definition 2 (Node type and edge type mapping function): We define \(\tau \left(v\right):V\to A\) and \(\phi \left(e\right):E\to R\) as the mapping function for node types and edge types, respectively.

Definition 3 (Node meta relation): For a node pair of \({v}_{1}\) and \({v}_{2}\) linked by an edge \({e}_{{{{{\mathrm{1,2}}}}}}\), the meta relation between \({v}_{i}\) and \({v}_{j}\) is denoted as \(langle \tau ({v}_{i}) \,,\phi ({e}_{i,j}),\tau ({v}_{j})\rangle\).

Giving the integrated matrix \(X\), we construct a bipartite gene-cell heterogeneous graph \(G\) with two node types (cell and gene) and one edge type (gene-cell edge). \({{{{{\bf{V}}}}}}={{{{{{\bf{V}}}}}}}^{{{{{{\bf{C}}}}}}}\cup {{{{{{\bf{V}}}}}}}^{{{{{{\bf{G}}}}}}}\), where \({{{{{{\bf{V}}}}}}}^{{{{{{\bf{G}}}}}}}=\left\{{v}_{i}^{G}|i={1,\, 2},\ldots,I\right\}\) denotes all genes, and \({{{{{{\bf{V}}}}}}}^{{{{{{\bf{C}}}}}}}=\big\{{v}_{j}^{C}|\,j={{{{\mathrm{1,\, 2}}}}},\ldots,J\big\}\) denotes all cells. \(E=\big\{{e}_{i,j}\big\}\) represents the edge between \({v}_{i}^{G}\) and \({v}_{j}^{C}\). For \({x}_{{ij}} \, > \, 0\), the weight of the corresponding edge \(\omega ({e}_{i,j})=1,\) otherwise, \(\omega ({e}_{i,j})=0\).

Joint embedding via a heterogeneous graph transformer

We propose an unsupervised HGT framework12,13 to learn graph embeddings of all the nodes and mine relationships between genes and cells. The input of HGT is the integrated matrix \(X\), and the outputs are the embeddings of cells and genes and attention scores representing the importance of genes to cells.

Definition 4 (Target node and source node): A node in \({{{{{\bf{V}}}}}}\) is considered as a target node, represented as \({v}_{t}\), when performing HGT to aggregate information and update embeddings of this node. A node is considered as a source node, represented as \({v}_{s},{v}_{s}\,\ne\, {v}_{t}\), if there is an edge between \({v}_{s}\) and \({v}_{t}\) in \(E\), denoted as \({e}_{s,t}\) for convenience.

Definition 5 (Neighborhood graph of target node): A neighborhood graph of a target node vt is induced from G and denoted as G′ = (V′, E′, A′, R′), where \({{{{{\bf{V}}}}}}^{\prime}=\{{v}_{t}\} {{\cup }}{{{{{\mathscr{N}}}}}}\left({v}_{t}\right)\), \({{{{{\mathscr{N}}}}}}\left({v}_{t}\right)\) is the complete set of neighbors of \({v}_{t}\), \({{{{{\bf{E}}}}}}^{\prime}=\{{e}_{i,j}\in {E|}{v}_{i},{v}_{j}\in {V{{\hbox{'}}}}\}\), A′ marks the target and source node types, and R′ represents the target-source edge. es,t E′ represents the edge between \({v}_{s}\) and \({v}_{t}\). As only one edge type is included in \(G\), the node meta relation of \({v}_{s}\) and \({v}_{t}\) is denoted as \(\left\langle \tau \left({v}_{s}\right),\phi \left({e}_{s,t}\right),\tau \left({v}_{t}\right)\right\rangle\).

  1. 1.

    Multi-head attention mechanism and linear mapping of vectors.

    Let \({{{{{\boldsymbol{{{{{\mathcal{H}}}}}}}}}}}^{{{{{{\bf{l}}}}}}}\) denotes the embedding of the \({l}^{{th}}\) HGT layer (\(l={{{{\mathrm{1,2}}}}},\ldots,L\)). The embedding of \({v}_{t}\) and \({v}_{s}\) on the \({l}^{{th}}\) layer is denoted as \({{{{{\boldsymbol{{{{{\mathcal{H}}}}}}}}}}}^{{{{{{\bf{l}}}}}}}[{{{{{{\bf{v}}}}}}}_{{{{{{\bf{t}}}}}}}]\) and \({{{{{\boldsymbol{{{{{\mathcal{H}}}}}}}}}}}^{{{{{{\bf{l}}}}}}}[{{{{{{\bf{v}}}}}}}_{{{{{{\bf{s}}}}}}}]\). A multi-head mechanism is applied to equally divide both \({{{{{\boldsymbol{{{{{\mathcal{H}}}}}}}}}}}^{{{{{{\bf{l}}}}}}}[{{{{{{\bf{v}}}}}}}_{{{{{{\bf{t}}}}}}}]\) and \({{{{{\boldsymbol{{{{{\mathcal{H}}}}}}}}}}}^{{{{{{\bf{l}}}}}}}[{{{{{{\bf{v}}}}}}}_{{{{{{\bf{s}}}}}}}]\) into \(H\) heads. Multi-head attention allows the model to jointly attend to information from different embedding subspaces, and each head can run through an attention mechanism in parallel to reduce computational time.

    For the \({h}^{{th}}\) head in the \({l}^{{th}}\) HGT layer, the \({{{{{\boldsymbol{{{{{\mathcal{H}}}}}}}}}}}^{{{{{{\bf{l}}}}}}}[{{{{{{\bf{v}}}}}}}_{{{{{{\bf{t}}}}}}}]\) is updated from \({{{{{\boldsymbol{{{{{\mathcal{H}}}}}}}}}}}^{{{{{{\bf{l}}}}}}{{{{{\boldsymbol{-}}}}}}{{{{{\bf{1}}}}}}}[{{{{{{\bf{v}}}}}}}_{{{{{{\bf{t}}}}}}}]\) and \({{{{{\boldsymbol{{{{{\mathcal{H}}}}}}}}}}}^{{{{{{\bf{l}}}}}}{{{{{\boldsymbol{-}}}}}}{{{{{\bf{1}}}}}}}[{{{{{{\bf{v}}}}}}}_{{{{{{\bf{s}}}}}}}]\). The \({{{{{\boldsymbol{{{{{\mathcal{H}}}}}}}}}}}^{{{{{{\bf{0}}}}}}}[{{{{{{\bf{v}}}}}}}_{{{{{{\bf{t}}}}}}}]\) and \({{{{{\boldsymbol{{{{{\mathcal{H}}}}}}}}}}}^{{{{{{\bf{0}}}}}}}[{{{{{{\bf{v}}}}}}}_{{{{{{\bf{s}}}}}}}]\) are the initial embedding of \({v}_{t}\) and \({v}_{s}\), respectively. Three linear projection functions are applied to map node embeddings into the \({h}^{{th}}\) vector. Specifically, the \({{{{{{\rm{Q}}}}}}\_{{{{{\rm{linear}}}}}}}_{\tau \left({{{{{{\rm{v}}}}}}}_{{{{{{\rm{t}}}}}}}\right)}^{{{{{{\rm{h}}}}}}}\) function maps \({v}_{t}\) into the \({h}^{{th}}\) query vector \({{{{{{\bf{Q}}}}}}}^{{{{{{\bf{h}}}}}}}\left({{{{{{\bf{v}}}}}}}_{{{{{{\bf{t}}}}}}}\right)\), with dimension \({{\mathbb{R}}}^{d}\to {{\mathbb{R}}}^{\frac{d}{H}}\), where \(d\) is the dimension of \({{{{{\boldsymbol{{{{{\mathcal{H}}}}}}}}}}}^{{{{{{\bf{l}}}}}}{{{{{\boldsymbol{-}}}}}}{{{{{\bf{1}}}}}}}[{{{{{{\bf{v}}}}}}}_{{{{{{\bf{t}}}}}}}]\) and \(\frac{d}{H}\) is the vector dimension per head. Similarly, the \({{{{{{\rm{K}}}}}}\_{{{{{\rm{linear}}}}}}}_{\tau \left({{{{{{\rm{v}}}}}}}_{{{{{{\rm{s}}}}}}}\right)}^{{{{{{\rm{h}}}}}}}\) and \({{{{{{\rm{V}}}}}}\_{{{{{\rm{linear}}}}}}}_{\tau \left({{{{{{\rm{v}}}}}}}_{{{{{{\rm{s}}}}}}}\right)}^{{{{{{\rm{h}}}}}}}\) function map the source node \({v}_{s}\) into the \({h}^{{th}}\) key vector \({{{{{{\bf{K}}}}}}}^{{{{{{\bf{h}}}}}}}\left({{{{{{\bf{v}}}}}}}_{{{{{{\bf{s}}}}}}}\right)\) and the \({h}^{{th}}\) value vector \({{{{{{\bf{V}}}}}}}^{{{{{{\bf{h}}}}}}}\left({{{{{{\bf{v}}}}}}}_{{{{{{\bf{s}}}}}}}\right)\).

    $${Q}^{h}\left({v}_{t}\right)={{Q}_{{linear}}}_{\tau \left({v}_{t}\right)}^{h}\left({{{{{{\mathcal{H}}}}}}}^{\left(l-1\right)}\left[{v}_{t}\right]\right)$$

    (9)

    $${K}^{h}\left({v}_{s}\right)={K{{{{{\rm{\_}}}}}}{linear}}_{\tau \left({v}_{s}\right)}^{h}\left({{{{{{\mathcal{H}}}}}}}^{\left(l-1\right)}\left[{v}_{s}\right]\right)$$

    (10)

    $${V}^{h}\left({v}_{s}\right)={V{{{{{\rm{\_}}}}}}{linear}}_{\tau \left({v}_{s}\right)}^{h}\left({{{{{{\mathcal{H}}}}}}}^{\left(l-1\right)}\left[{v}_{s}\right]\right)$$

    (11)

    Each type of node has a unique linear projection to maximally model the distribution differences.

  2. 2.

    Heterogeneous mutual attention

    To calculate the mutual attention between \({v}_{t}\) and \({v}_{s}\), we introduce the Attention operator which estimates the importance of each \({v}_{s}\) to \({v}_{t}\):

    $${Attention}\left({v}_{s},\, {e}_{s,\, t},\,{v}_{t}\right)=\mathop{{{{{{\rm{Softmax}}}}}}}\limits_{\forall v{{\in }}{{{{{\mathscr{N}}}}}}\left({v}_{t}\right)}\left(\mathop{\parallel }\limits_{H}\left({{ATT}{{\_}}{head}}^{h}\left({v}_{s},\, {e}_{s,\,t},\,{v}_{t}\right)\right)\right)$$

    (12)

    The attention function can be described as mapping a query vector and a set of key-value pairs to an output for each node pair \(e=({v}_{s},\, {v}_{t})\). The overall attention of \({v}_{t}\) and \({v}_{s}\) is the concatenation of the attention weights in all heads, followed by a softmax function. \(\mathop{{{{{{\rm{||}}}}}}}\limits_{H}(\cdot )\) is the concatenation function. The ATT_headh\(\left({v}_{s},\, {e}_{s,\,t},\, {v}_{t}\right)\) term is the \({h}^{{th}}\) head attention weight between the \({v}_{t}\) and \({v}_{s}\), which can be calculated by:

    $${{ATT}{{{{{\rm{\_}}}}}}{head}}^{h}\left({v}_{s},\, {e}_{s,\, t},\,{v}_{t}\right)=\left({K}^{h}\left({v}_{s}\right){W}_{\phi \left({e}_{s,t}\right)}^{{ATT}}{Q}^{h}{\left({v}_{t}\right)}^{T}\right)\\ \cdot \frac{\mu \left\langle \tau \left({v}_{s}\right),\phi \left({e}_{s,\, t}\right),\,\tau \left({v}_{t}\right)\right\rangle }{\sqrt{d}},$$

    (13)

    The similarity between the queries and keys was measured where \({W}_{\phi \left({e}_{s,t}\right)}^{{ATT}}\in {{\mathbb{R}}}^{\frac{d}{H}\times \frac{d}{H}}\) is a transformation matrix to capture meta-relation features. \({(\cdot )}^{T}\) is the transposal function and \(\mu\) is a prior tensor to denote the significance for each the node meta relation \(\left\langle \tau \left({v}_{s}\right),\,\phi \left({e}_{s,t}\right),\,\tau \left({v}_{t}\right)\right\rangle\), serving as an adaptive scaling to the attention. The concatenation of attention heads results in the attention coefficients between \({v}_{s}\) and \({v}_{t}\), followed by a Softmax function in Eq. 12.

  3. 3.

    Heterogeneous message passing

    A Message operator is used to extract the message of \({v}_{s}\) that can be passed to \({v}_{t}\). The multi-head \({Message}\) is defined by:

    $${Message}\left({v}_{s},\,{e}_{s,\,t},\,{v}_{t}\right)=\mathop{\parallel }\limits_{H}\left({{MSG}{{{{{\rm{\_}}}}}}{head}}^{h}\left({v}_{s},\,{e}_{s,\,t},\,{v}_{t}\right)\right)$$

    (14)

    The \({h}^{{th}}\) head message MSG_headh\(\left({v}_{s},\,{e}_{s,\,t},\,{v}_{t}\right)\) for each edge \(\left({v}_{s},{v}_{t}\right)\) is defined as:

    $${{MSG}{{{{{\rm{\_}}}}}}{head}}^{h}\left({v}_{s},\,{e}_{s,\,t},\,{v}_{t}\right)={V}^{h}\left({v}_{s}\right){W}_{\phi \left({e}_{s,t}\right)}^{{MSG}}$$

    (15)

    where each source node \({v}_{s}\) in the head \(h\) was mapped into a message vector by a linear projection \({V}^{h}\left({v}_{s}\right):{{\mathbb{R}}}^{d}\to \times {{\mathbb{R}}}^{\frac{d}{H}}\). \({W}_{\phi \left(e\right)}^{{MSG}}\in {{\mathbb{R}}}^{\frac{d}{H}\times \frac{d}{H}}\) is also a transformation matrix similar to \({W}_{\phi \left({e}_{s,t}\right)}^{{ATT}}\).

  4. 4.

    Target specific aggregation

    To update the embedding of \({v}_{t}\), the final step in the \({l}^{{th}}\) HGT layer is to Aggregate the neighbor information obtained in this layer \(\widetilde{{{{{{\boldsymbol{{{{{\mathcal{H}}}}}}}}}}}^{{{{{{\bf{l}}}}}}}}\left[{{{{{{\bf{v}}}}}}}_{{{{{{\bf{t}}}}}}}\right]\) into the target node embedding \({{{{{{\mathcal{H}}}}}}}^{l-1}\left[{v}_{t}\right]\).

    $$\widetilde{{{{{{{\mathcal{H}}}}}}}^{l}}\left[{v}_{t}\right]=\mathop{{Aggregate}}\limits_{\forall {v}_{s}{{\in }}{{{{{\mathscr{N}}}}}}\left({v}_{t}\right)}\left({Attention}\left({v}_{s},\,{e}_{s,\, t},\,{v}_{t}\right)\cdot {Message}\left({v}_{s},\,{e}_{s,\,t},\,{v}_{t}\right)\right)$$

    (16)

    $${{{{{{\mathcal{H}}}}}}}^{l}\left[{v}_{t}\right]=\theta \left({ReLU}(\widetilde{{{{{{{\mathcal{H}}}}}}}^{l}}\left[{v}_{t}\right])\right)+\left(\theta -1\right){{{{{{\mathcal{H}}}}}}}^{l-1}\left[{v}_{t}\right],$$

    (17)

    where \(\theta\) is a trainable parameter and \({ReLU}\) is the activation function. The final embedding of \({v}_{t}\) is obtained by stacking information via all \(L\) HGT layers, and \(L\) is set to be 2 in DeepMAPS.

  5. 5.

    Determination of gene to cell attention

    We call out the final attention score \({{{{{{\mathcal{a}}}}}}}_{i,j}\) of gene \(i\) to cell \(j\) in the last HGT layer after the completion of the HGT process:

    $${{{{{{\mathcal{a}}}}}}}_{i,j}=\sqrt{\mathop{\sum}\limits_{h}{{{ATT}\_{head}}^{h}\left(i,\, j\right)}^{2}}$$

    (18)

HGT training on subgraphs

To improve the efficiency and capability of the HGT model on a giant heterogeneous graph (tens of thousands of nodes and millions of edges), we deploy a modified HGSampling method for subgraph selection and HGT training on multiple mini-batches12. For the graph \(G\) with \(I\) genes and \(J\) cells, the union of subgraphs should cover \(a\)% (set to be 30%) nodes of gene and cell nodes to ensure the training power. As such, the sampler constructs a number of small subgraphs (50 in DeepMAPS) from the given heterogeneous graph \(G\), and feeds the subgraphs into the HGT model in different batches using multiple GPUs. Each graph should include \(a\%\times I/50\) genes, and \(a\%\times J/50\) cells. Take a cell \(j\) as a target node \({v}_{t}\) and its neighbor \({v}_{s}{{\in }}{{{{{\mathscr{N}}}}}}\left({v}_{t}\right)\), corresponding to gene \(i\), as source nodes, we calculate the probability on the edge \({e}_{s,t}\) as:

$${Prob}\left({e}_{s,t}\right)=\frac{x({v}_{s},\, {v}_{t})}{{\sum }_{v\in {{{{{\mathscr{N}}}}}}\left({v}_{t}\right)}x(v,\,{v}_{t})},$$

(19)

where \(x({v}_{s},{v}_{t})={x}_{{ij}}\) refers to the expression or GAS value of the gene \(i\) in cell \(j\) in the integrated matrix \(X\). Thus, for each target node \({v}_{t}\), we randomly select \(\frac{a\%\times I/50}{a\%\times J/50}\) neighbor genes for \({v}_{t}\) based on sampling probability \({{{{{\bf{Prob}}}}}}\left({{{{{{\bf{e}}}}}}}_{{{{{{\bf{s}}}}}}{{,}}{{{{{\bf{t}}}}}}}\right)\). HGT hyperparameters, such as \({W}_{\phi \left({e}_{s,t}\right)}^{{ATT}}\), \({W}_{\phi \left({e}_{s,t}\right)}^{{MSG}}\), and \(\theta\), will be trained and inherited sequentially from subgraphs 1 to 50 in one epoch. The subgraph training is performed in an unsupervised way with a graph autoencoder (GAE). The HGT is the encoder layer, and the inner product of embeddings is the decoder layer. We calculate the loss function of the GAE as the Kullback-Leibler divergence (KL) of reconstructed matrix \(\hat{X}\) and the integrated matrix \(X\):

$${loss}={KL}\left({sof\!tmax}(\hat{X}),\,{sof\!tmax}(X)\right)$$

(20)

The subgraph training will be completed if the loss is restrained or reaches 100 epochs, whichever happens first.

Determination of active genes module in cell clusters

Predict cell clusters

We deploy a Louvain clustering (Seurat v3) to predict cell clusters cell embeddings \({{{{{{\mathcal{H}}}}}}}^{{{{{{\boldsymbol{L}}}}}}}\left[{{{{{{\boldsymbol{v}}}}}}}_{{{{{{\boldsymbol{c}}}}}}}\right]\) generated from the final HGT layer. The resolution of Louvain clustering is determined by a grid-search test of multiple HGT hyperparameter combinations, and we set the clustering resolution of 0.4 as the default.

Identify cell cluster-active gene association network

We used an SFP model17 to select genes that highly contribute to cell cluster characterization and construct cell cluster-active gene association networks. Define a new heterogeneous graph \(\widetilde{G}=\left(V,\,\widetilde{E}\right),V\in {V}^{G}\cup {V}^{C},\widetilde{E}\in {\widetilde{E}}^{1}\cup {\widetilde{E}}^{2}\), where \({\widetilde{E}}^{1}\) represents the gene-gene relations, and \({\widetilde{E}}^{2}\) represents the gene-cell relations. The weight of the corresponding edge \(\omega ({\widetilde{e}}_{{i}_{1},{i}_{2}}^{1})\) of \({v}_{{i}_{1}}^{G}\in {V}^{G}\) and \({v}_{i}^{G}\in {V}^{G}\) is the Pearson’s correlation of the HGT embeddings between \({v}_{{j}_{1}}^{G}\) and \({v}_{{j}_{2}}^{G}\). The weight of the corresponding edge \(\omega \left({\widetilde{e}}_{i,j}^{2}\right)\) of \({v}_{i}^{G}\in {V}^{G}\) and \({v}_{j}^{G}\in {V}^{C}\) is the final attention score \({{{{{{\mathcal{a}}}}}}}_{i,j}\). Only edges with \(\omega \left({\widetilde{e}}_{{i}_{1},{i}_{2}}^{1}\right) \, > \, 0.5\) and \(\omega \left({\widetilde{e}}_{i,j}^{2}\right) \, > \, \mu \left({{{{{{\mathcal{a}}}}}}}_{i,j}\right)+{sd}({{{{{{\mathcal{a}}}}}}}_{i,j})\), where \(\mu\)() represents the mean and \({sd}()\) represents the standard deviation of \({{{{{{\mathcal{a}}}}}}}_{i,j}\), will be kept within a cell cluster. The weight of the remaining edges will then be max-min normalized to ensure an edge with the largest weight being rescaled to be 0 and an edge with the smallest weight being rescaled to be 1.

Let \(Z\) be the number of clusters predicted via Louvain clustering, and \({V}^{C\left[z\right]}=\{{v}^{C[z]}\}\) be the node set corresponding with the cell set in cluster label of \(z={{{{\mathrm{1,2}}}}},\ldots,Z\). We then formulate this problem using a combinatorial optimization model defined below

$$\mathop{{{\min }}}\limits_{{\widetilde{E}}^{{{{{{\mathcal{L}}}}}}}\subseteq {\widetilde{E}}^{1}\cup {\widetilde{E}}^{2}}\mathop{\sum}\limits_{e\in {\widetilde{E}}^{{{{{{\mathcal{L}}}}}}}}\omega \left(e\right)$$

s.t.

$${{{{{\mathcal{L}}}}}}\left({v}_{{i}_{1}}^{C},\,{v}_{{i}_{2}}^{C}\right)=1,\,\forall {v}_{{i}_{1}}^{C},\,{v}_{{i}_{2}}^{C}\in {V}^{C\left[z\right]},\,z=1,2,\ldots,Z$$

(21)

where \({{{{{\mathcal{L}}}}}}({v}_{{j}_{1}}^{C},{v}_{{j}_{2}}^{C})\) is a binary indicator function representing whether two cell nodes,\({v}_{{j}_{1}}^{C}\) and \({v}_{{j}_{2}}^{C}\), can be connected (1) or not (0) in \(\widetilde{G}\) via a \({\widetilde{E}}_{{j}_{1},{j}_{2}}^{{{{{{\mathcal{L}}}}}}}=\{{\widetilde{e}}_{{i}_{1},{j}_{1}}^{2},\,{\widetilde{e}}_{{i}_{1},{i}_{2}}^{1},\,{\widetilde{e}}_{{i}_{2},{i}_{3}}^{1}\ldots,{\widetilde{e}}_{{i}_{t-1},{i}_{t}}^{1},{\widetilde{e}}_{{i}_{t},{j}_{2}}^{2}\}\) path. Denote \({\widetilde{E}}^{{{{{{\mathcal{L}}}}}}}=\{{\widetilde{E}}_{{j}_{1},{j}_{2}}^{{{{{{\mathcal{L}}}}}}}\}\) as the complete collection of \({\widetilde{E}}_{{j}_{1},{j}_{2}}^{{{{{{\mathcal{L}}}}}}}\) connecting \({v}_{{j}_{1}}^{C}\) and \({v}_{{j}_{2}}^{C}\). The combinatorial optimization model aims to identify the path connecting \({v}_{{j}_{1}}^{C}\) and \({v}_{{j}_{2}}^{C}\) with the minimum summed edge weight. We consider the gene networks remained in an SFP result of cluster \(z\) as the cluster-active gene association networks.

Construct GRNs from scRNA-ATAC-seq data

For genes in a cell cluster-active gene association network resulting from SFP, a set of TFs \(q={{{{\mathrm{1,2}}}}},\ldots,Q\) can then be assigned to genes. The TF-peak relations are retrieved by finding alignments between the TF binding sites with peak regions in the scATAC-seq data, and the peak-gene relations are established previously when calculating the potential regulation scores \({r}_{{ik|j}}\) (Eq. 3). We design a regulatory intensity (RI) score \({{{{{{\mathcal{s}}}}}}}_{i,j,q}\) to quantify the intensity of TF \(q\) in regulating gene \(i\) in the cell \(j\):

$${{{{{{\mathcal{s}}}}}}}_{{ij}{{{{{\rm{|}}}}}}q}=\mathop{\sum}\limits_{k}{b}_{{qk}}^{A}\cdot {r}_{{ik}{{{{{\rm{|}}}}}}j}$$

(22)

where \({b}_{{qk}}^{A}\) is the binding affinity score of TF \(q\) to peak \(k\). The binding affinity score is calculated by three steps: (a) We retrieved the genome browser track file from JASPAR, which stores all known TF binding sites of each TF. A p-value score was calculated as -log10 (p) × 100 in JASPAR, where 0 corresponds to a p-value of 1 and 1,000 corresponds to a p-value <10−10. We removed TF binding sites with p-value scores smaller than 500. (b) If a TF binding site overlaps with any peak regions in the scATAC-seq profile, it will be kept, otherwise, it will be removed. (c) Divide the corresponding p-value score by 100. We claim that a gene set regulated by the same TF is a regulon.

We calculate a regulon activity score (RAS) \({{{{{\mathcal{r}}}}}}\left(q,z\right)\) of a regulon with genes regulated by TF \(q\) in cell cluster \(z\) as:

$${{{{{\mathcal{r}}}}}}\left(q,\,z\right)=\frac{\mathop{\sum}\limits_{i\in {I}_{q}}\mathop{\sum}\limits_{j\in C\left[z\right]}{x}_{{ij}}\cdot {{{{{{\mathcal{s}}}}}}}_{{ij}{{{{{\rm{|}}}}}}q}}{I\cdot J}$$

(23)

where \({I}_{q}\) denotes genes regulated by TF \(q\) in cell cluster \(z\). We used the Wilcoxon rank-sum test to identify differentially active regulons in a cluster based on RAS. If the BH-adjusted p-value is less than 0.05 between different cell clusters and the log fold change larger than 0.10, we consider the regulon to be differentially active in this cluster, and it is defined as a cell-type-specific regulon (CTSR).

A GRN in a cell cluster is constructed by merging regulons in a cell cluster. The eigenvector centrality (\({{{{{{\mathcal{c}}}}}}}_{v}\)) of a TF node \(v\) in GRN was defined as:

$${{{{{{\mathcal{c}}}}}}}_{v}={\alpha }_{\max }\left(v\right)$$

(24)

where \({\alpha }_{\max }\) is the eigenvector corresponding to the largest eigenvalue of the weighted adjacency matrix of a GRN. TFs with higher \({{{{{{\mathcal{c}}}}}}}_{v}\) ranks were regarded as master TFs (top 10 by default).

Benchmarking quantification and statistics

Grid-search parameter test for cell clustering on benchmark data

To determine the default parameters of HGT on different data types, we performed a grid-search test on HGT parameters, including the pair of number of embeddings and number of heads (91/13, 104/13, 112/16, and 128/16), learning rate (0.0001, 0.001, and 0.01), and training epochs (50, 75, and 100). Altogether, 36 parameter combinations were tested. For each of the three data types, the HGT parameter training were performed on three benchmark data, and the default parameter combination was selected based on the highest median score (ARI for multiple scRNA-seq data and CITE-seq data with benchmark labels and AWS for scRNA-ATAC-seq data without benchmark labels) of the three datasets.

To assess the performance of DeepMAPS alongside other proposed scMulti-omics benchmark tools, we compared DeepMAPS with Seurat (v3.2.3 and v4.0, https://github.com/satijalab/seurat), MOFA + (v1.0.0, https://github.com/bioFAM/MOFA2), Harmony (v0.1, https://github.com/immunogenomics/harmony), TotalVI (v0.10.0, https://github.com/YosefLab/scvi-tools), and GLUE (v0.3.2, https://github.com/gao-lab/GLUE). Because of the integration capability for different data types, DeepMAPS was compared with Seurat v 3.2.3 and Harmony on multiple scRNA-seq data, with Seurat v4.0.0, MOFA+, and TotalVI on CITE-seq data, and with Seurat v4.0.0, MOFA+, and GLUE on scRNA-ATAC-seq data. For each benchmarking tool, grid-search tests were also applied to a combination of parameters, such as the number of dimensions for cell clustering and clustering resolution.

The default HGT parameter combination selected for each data type was then applied to additional datasets (one multiple scRNA-seq, one CITE-seq, and three scRNA-ATAC-seq data) for independent tests. All benchmarking tools use their default parameters.

To showcase the rationale for selecting integrative methods and cell clustering methods in DeepMAPS, we evaluated the cell clustering performances by replacing the methods with several others. Specifically, for data integration, we replaced the CCA method with Harmony integration (multiple scRNA-seq), replaced the CLR method with Seurat weighted nearest neighbor method (CITE-seq), and replaced the velocity-weighted method with Seurat weighted nearest neighbor method and without using velocity (scRNA-ATAC-seq). For the cell clustering method, we replaced Louvain clustering with Leiden and the smart local moving (SLM) method. We also compared the influence of clustering resolution (use 0.4, 0.8, 1.2, and 1.6) to the cell clustering results in DeepMAPS. Each comparison was performed on all 36 parameter combinations as used in the grid-search test. For DeepMAPS without velocity, we simply add up the gene expression matrix from scRNA-seq data and the gene potential activity matrix derived from scATAC-seq data, considering the balance weight introduced by velocity for gene j in cell i as 1.

Adjusted rand index (ARI)

ARI is used to compute similarities by considering all pairs of the samples assigned in clusters in the current and previous clustering adjusted by random permutation. A contingency table is built to summarize the overlaps between the two cell label lists with \(b\) elements (cells) to calculate the ARI. Each entry denotes the number of objects in common between the two label lists. The ARI can be calculated as:

$${ARI}=\frac{\frac{{J}_{a}+{J}_{b}}{{C}_{n}^{2}}-E\left[\frac{{J}_{a}+{J}_{b}}{{C}_{n}^{2}}\right]}{{{\max }}\left(\frac{{J}_{a}+{J}_{b}}{{C}_{n}^{2}}\right)-E\left[\frac{{J}_{a}+{J}_{b}}{{C}_{n}^{2}}\right]}$$

(25)

Where \(E\left[.\right]\) is the expectation, \({J}_{a}\) is the number of cells assigned to the same cell cluster as benchmark labels; \({J}_{b}\) is the number of cells assigned to different cell clusters as benchmark labels; \({C}_{n}^{2}\) is the combination of selecting two cells from the total of \(n\) cells in the cluster.

Average Silhouette Width (ASW)

Unlike ARI, which requires known ground truth labels, a silhouette score refers to a method of interpretation and validation of consistency within clusters of data. The silhouette weight indicates how similar an object is to its cluster (cohesion) compared to other clusters (separation). The silhouette width ranges from −1 to +1, where a high value indicates that the object is well-matched to its cluster. The silhouette score sil(j) can be calculated by:

$${sil}\left(j\right)=\frac{\left|n\left(j\right)-m\left(j\right)\right|}{{{\max }}\left\{m\left(j\right),\,n\left(j\right)\right\}}$$

(26)

where \(m(j)\) is the average distance between a cell \(j\) and all other cells in the same cluster, and \(n\left(j\right)\) is the average distance of \(i\) to all cells in the nearest cluster to which \(j\) does not belong. We calculated the mean silhouette score of all cells as the ASW to represent the silhouette score of the dataset.

Calinski-Harabasz index

The CH index calculates the ratio of the sum of between-clusters dispersion and inter-cluster dispersion for all clusters. A higher CH index indicates a better performance. For a set of data E of size \({n}_{E}\) with \(k\) clusters, the CH index is defined as:

$${CH}=\frac{t\left({B}_{k}\right)}{t\left({W}_{k}\right)}\times \frac{{n}_{E}-k}{k-1}$$

(27)

$${W}_{k}=\mathop{\sum }\limits_{q=1}^{k}\mathop{\sum}\limits_{x\in {C}_{q}}\left(x-{c}_{q}\right){(x-{c}_{q})}^{T}$$

(28)

$${B}_{k}=\mathop{\sum }\limits_{q=1}^{k}{n}_{q}({c}_{q}-{c}_{E}){({c}_{q}-{c}_{q})}^{T}$$

(29)

where \(t\left({B}_{k}\right)\) is the trace of the between group dispersion matrix, and \(t\left({W}_{k}\right)\) is the trace of the within-cluster dispersion matrix. \({C}_{q}\) is the set of points in cluster \(q\), \({c}_{q}\) is the center of cluster \(q\), \({c}_{E}\) is the center of E, and \({n}_{q}\) is the number of points in cluster \(q\). \(T\) refers to the matrix transformation.

Davies-Bouldin index

The DB index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters themselves. A lower DB index relates to a model with better separation between the clusters. For data with \(k\) clusters, \(i\in k\) and \(j\in k\), the DB index is defined as:

$${DB}=\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}\mathop{{{\max }}}\limits_{i\ne j}{R}_{{ij}}$$

(30)

$${R}_{{ij}}=\frac{{s}_{i}+{s}_{j}}{{d}_{{ij}}}$$

(31)

where \({s}_{i}\) and \({s}_{j}\) are the average distance between each point within the cluster to the cluster centroid. \({d}_{{ij}}\) is the distance of cluster centroids of \(i\) and \(j\).

Gene association network evaluations

We evaluated the performance of the gene association network identified in DeepMAPS by comparing it with IRIS315 and a normal gene co-expression network inference using all genes. We calculated the closeness centrality and eigenvector centrality for the network generated in each tool. The formulations are given below.

Closeness centrality (CC)

The closeness centrality (CC)76 of a vertex \(u\) is defined by the inverse of the sum length of the shortest paths to all the other vertices \(v\) in the undirected weighted graph. The formulation is defined as:

$${CC}\left(u\right)=\frac{1}{\mathop{\sum}\limits_{v\in V}{d}_{w}\left(u,v\right)}$$

(32)

where \({d}_{w}\left(u,\,v\right)\) is the shortest weighted path between \(u\) and \(v\). If there is no path between vertex \({{{{{\rm{u}}}}}}\) and \({{{{{\rm{v}}}}}}\), the total number of vertices is used in the formula instead of the path length. A higher CC indicates a node is more centralized in the network, reflecting a more important role of this gene in the network. The CC is calculated using igraph R package with function igraph::betweenness. We take the average CC of all nodes in a network to represent the network CC.

Eigenvector centrality (EC)

Eigenvector centrality (EC)77 scores correspond to the values of the first eigenvector of the graph adjacency matrix. The EC score of \(u\) is defined as:

$${EC}\left(u\right)=\lambda \mathop{\sum}\limits_{v\in G}{a}_{{uv}}{x}_{v}$$

(33)

where λ is inverse of the eigenvalue of eigenvector \(x=({x}_{1},{x}_{2},\ldots,{x}_{n})\), \({a}_{{uv}}\) is the adjacent weighted matrix of undirect graph G. A node with a high eigenvector centrality score means that it is connected to many nodes which themselves have high scores. The EC is calculated using igraph R package with function igraph::evcent. We take the average EC of all nodes in a network to represent the network EC.

Evaluations on GRN

For scRNA-ATAC-seq data, we compared cell-type-specific GRNs inferred from DeepMAPS with (i) IRIS3 and SCENIC on the scRNA-seq matrix, (ii) IRIS3 and SCENIC on GAS matrix, (iii) MAESTRO on scATAC-seq matrix, and (iv) MAESTRO on original scRNA-seq and scATAC-seq matrix. For each dataset comparison, we set the cell clusters used in the benchmarking tool the same as generated in DeepMAPS to ensure fairness. GRNs generated from each tool were compared with three public functional databases, including Reactome21, DoRothEA22, and TRRUST v223. Only human sample datasets were used for comparison as these databases are all human-related. We performed hypergeometric tests for GRN resulting in each tool to each database and compared the precision, recall, and F1 score of enriched GRNs and functional terminologies.

Cell cluster leave-out test

For a benchmark dataset with a real cell type label, we removed all cells in one cell type and ran DeepMAPS. Then, we traversed all cell types (one at a time) to evaluate the robustness of ARI. We removed cells in predicted cell clusters from DeepMAPS and other benchmark tools for data without benchmark labels.

DeepMAPS server construction

DeepMAPS is hosted on an HPE XL675d RHEL system with 2 × 128-core AMD EPYC 7H12 CPU, 64GB RAM, and 2×NVIDIA A100 40GB GPU. The backend server is written in TypeScript using the NestJs framework. Auth0 is used as an independent module to provide user authentication and authorization services. Redis houses a queue of all pending analysis jobs. There are two types of jobs in DeepMAPS: The stateful jobs are handled by the Plumber R package to provide real-time interactive analysis; and the stateless jobs, such as CPU-bound bioinformatics pipelines and GPU training tasks that could take a very long time, are constructed using Nextflow. All running jobs are orchestrated using Nomad, allowing each job to be assigned with proper cores and storage and keeping jobs scalable based on the server load. The job results are deposited into a MySQL database. The front-end is built with NUXT, Vuetify as the UI library, Apache ECharts, and Cytoscape.js for data visualization. The frontend server and backend server are communicated using REST API.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.