The scDEED algorithm
Given a gene-by-cell matrix (after appropriate normalization and logarithmic transformation, i.e., the input into PCA with users’ discretion) with n columns (corresponding to n cells) and a 2D-embedding method, e.g., t-SNE and UMAP, with a given hyperparameter setting, scDEED finds dubious cell embeddings in the following six steps.
Step 1. scDEED constructs a permuted data matrix by independently permuting the n cells for each gene (i.e., independently shuffling the n values in each row) in the original data matrix.
Step 2 (optional). In most single-cell data analysis pipelines, e.g., the Python package Scanpy46 and the R package Seurat47, users perform PCA with a chosen K (the number of PCs) on the original data matrix before applying the 2D-embedding method. Similarly, scDEED asks users to input K, and it performs PCA with K on both the original data matrix and the permuted data matrix. Users may choose K based on their preferred method. For the results of this study, we chose K following the original studies or based on the elbow plot if the original studies did not provide K.
Step 3. scDEED applies the 2D-embedding method (with a given hyperparameter setting and rand.seed = 100) to the original and permuted pre-embedding matrices. These matrices are either the original data (if Step 2 is not performed) or the K PCs (so each matrix has dimensions K × n). Hence, each cell receives two 2D embeddings, one original and one permuted.
Step 4. Based on the original data before and after the 2D embedding, scDEED defines a reliability score for each cell i = 1,…, n, based on cell i’s x% (default x = 50, the only hyperparameter of scDEED) closest neighbors in the 2D-embedding space and those in the pre-embedding space (the PC space if Step 2 is performed or the original space otherwise), with the neighbors in each space defined based on the Euclidean distance. Given the two sets of neighbors, scDEED calculates cell i’s Euclidean distances to the ordered neighbors (from the closest to the farthest) in each set in the 2D-embedding space, obtaining two distance vectors of length x% × n (rounded to the closest integer). Finally, scDEED defines the reliability score as the Pearson correlation of the two distance vectors. That is, each cell’s reliability score ranges from −1 to 1; a higher reliability score indicates a better agreement between the cell’s ordered neighbors before and after the 2D embedding. We use the Pearson correlation because the actual values of the Euclidean distances in the 2D embedding space matter (for our visualization and interpretation), not just the ranks of the Euclidean distances used in the Spearman correlation.
Step 5. Based on the permuted data before and after the 2D embedding, scDEED applies the same procedure in Step 4 to obtain the null reliability scores of the n cells. Because of the permutation, the similarities among cells are disrupted, and no biological neighboring relationships are preserved by the 2D embedding. Hence, each cell’s neighbors are purely determined by random chance, and its reliability score reflects the random agreement between its ordered neighbors before and after the 2D embedding. Leveraging the n null reliability scores, scDEED finds the thresholds for calling a cell’s reliability score low or high.
Step 6. scDEED defines dubious cell embeddings as the embeddings of the cells whose reliability scores are less than or equal to the 5-th percentile of the n null reliability scores. On the other end, scDEED defines trustworthy cell embeddings as the embeddings of the cells whose reliability scores are greater than or equal to the 95-th percentile of the n null reliability scores.
After the above steps, scDEED reports the number of dubious cell embeddings given a parameter setting. From a grid search of candidate hyperparameter settings, scDEED finds the setting that minimizes the number of dubious cell embeddings.
In the scDEED R package, the following candidate hyperparameter values are set by default, but users can specify their own candidate hyperparameter values. For t-SNE, the default candidate perplexity values are 20, 50, …, 380, 410, 450, 500, …, 750, and 800. For UMAP, the default n.neighbors values are 5, 6, …, 29, 30, 35, 40, 45, and 50; the default min.dist values are 0.0125, 0.05, 0.1, 0.2, …, 0.7, and 0.8.
Analysis of the effectiveness of the permutation strategy
We illustrate the effects of permutation in removing cell-cell relationships by permuting the inDrops dataset twice (with different random seeds). First, we examined the PCA plots of the original inDrops dataset and the two permuted datasets (Supplementary Fig. S16a). We observed that the annotated cell types were distinguishable in the original PCA plot, while in the two permuted PCA plots, all cell types were mixed. Next, we examined the t-SNE plots (at the perplexity of 40) of the three datasets and observed a similar loss of cell type patterns in the permuted datasets (Supplementary Fig. S16b). Lastly, we examined the gene expression levels in the three datasets (Supplementary Fig. S16c). We observed that the annotated cell types exhibited clustered patterns of gene expression profiles in the original dataset; however, these patterns disappeared in the permuted datasets. Hence, we conclude that permutation is effective for removing cell-cell relationships in real data.
Sensitivity analysis of scDEED’s only hyperparameter “similarity percent”
Steps 4 and 5 of the scDEED algorithm require the only hyperparameter of scDEED, x, the “similarity percent” (i.e., the percentage of closest neighbors, or the neighborhood size). The default value is x = 50, meaning that half of all cells are considered as neighbors. Intuitively, a smaller value of x defines a smaller neighborhood size and would thus place a greater emphasis on preserving local structures. To investigate the effect of x on the performance of scDEED, we examined x = 5, 20, 35, 50, 65, 80, and 95 on the Hydra dataset with t-SNE as the embedding method.
First, we examined the numbers of dubious and trustworthy cell embeddings found at each x when applying scDEED under a range of t-SNE perplexity values (Supplementary Fig. S17a–c). At a small similarity percent x = 5, the number of dubious cell embeddings was relatively stable across the perplexity values, an expected result as t-SNE was designed to preserve cells’ local neighborhoods. At a large similarity percent x = 95, the number of trustworthy embeddings was smaller than at the other x values under most perplexity values; this result was also expected because t-SNE was not designed to preserve cells’ global topology. Hence, using a too small or large x would not reflect t-SNE’s ability to preserve mid-range neighbors or adjacent cell clusters’ relative positions. This result justified the default value of x = 50 in scDEED.
Second, we observed that, at any x, as the perplexity value increased past a threshold (around 170 for the Hydra dataset), the number of dubious cell embeddings tended to stay stable and did not decrease further (Supplementary Fig. S17a). This result provided evidence that scDEED does not have a bias towards large perplexity values. Notably, at the original perplexity, too small or large x values found fewer dubious cell embeddings than the x values around the default x = 50 did, again implying that too small or large x values are unsuitable for detecting dubious cell embeddings (Supplementary Fig. S17d). Most importantly, the x values around 50 resulted in optimized perplexity values that were similar, confirming the stability of scDEED with the default x = 50 (Supplementary Fig. S17e).
To further explain the above observations, we examined the distributions of reliability scores in the permuted data and the original data at each x. We found the distributions of null reliability scores (i.e., reliability scores of permuted cells) to be stable across x values despite exhibiting a slight monotone shift to the right as x increased (Supplementary Fig. S17f). On the original data, the distribution of reliability scores of unpermuted cells was most concentrated on lower scores when x = 95 (Supplementary Fig. S17g). This explains why x = 95 found the fewest cell embeddings to be trustworthy.
Third, we examined the dubious or trustworthy cell embeddings detected by scDEED at x = 5, 50, or 95 from the original Hydra embeddings. Several clusters in the original embeddings were detected as dubious at both x = 50 and 95, but not at x = 5 (Supplementary Fig. S18a). Also, only at x = 5, almost all cell embeddings were found as trustworthy (Supplementary Fig. S18b). We believe that the different detection result at x = 5 was due to the fact that examining too small neighborhoods was ineffective in revealing the clusters that had dubious positions relative to other clusters.
Fourth, we further examined the similarities of dubious cell embeddings detected at different x values given a perplexity value. We focused on dubious cell embeddings because their number is the criterion scDEED uses for optimization. We considered three perplexity values covering a wide range: perplexity 40 (used for the original embeddings), perplexity 230 (optimized by scDEED at x = 50), and perplexity 410 (the maximum candidate perplexity value in the scDEED package) (Supplementary Fig. S19). Given each perplexity value, we used scDEED to detect a set of dubious cell embeddings at each x value; then we calculated the Jaccard index between the two sets for every pair of x values (Supplementary Fig. S19). We observed that the dubious embeddings detected at too small x values had little-to-no agreement with the dubious embeddings detected at other x values, an undesirable result as we would expect the dubious embeddings to be reasonably robust to the x value. In contrast, middle-to-high x values (x = 50, 65, and 80) tended to have high agreement with each other, particularly x = 50. We also examined three x values close to 50 (x = 40, 50, 60) and confirmed that their respectively optimized visualizations were highly similar (Supplementary Fig. S20).
In conclusion, the above sensitivity analysis results supported our default choice of x = 50. A similar rationale is described in ref. 27, which found that effective dimension reduction required emphasis on mid-range neighbors.
Alternative hyperparameter optimization via “kneedle”
Instead of looking for the hyperparameter value (e.g., the t-SNE perplexity) to minimize the number of dubious embeddings over a default grid of candidate hyperparameter values, we implemented the “kneedle” method that searches for the hyperparameter value as the elbow point in the plot of the number of dubious embeddings (i.e., the y-axis) versus the hyperparameter value (i.e., the x-axis)30. We investigated this alternative optimization approach on two datasets and found that the resulting t-SNE visualizations were highly similar to those resulted from the grid-search global min approach used in scDEED (Supplementary Figs. S3–4, 20).
We also compared the two hyperparameter optimization approaches (the global min approach used in scDEED and the “kneedle” method) for three x values (40, 50, and 60) on the Hydra dataset (Supplementary Fig. S20). We found that for each x value, the two approaches’ optimized perplexity values led to highly similar t-SNE visualizations. The results confirm that scDEED’s optimized visualization based on the number of dubious embeddings is not sensitive to the optimization approach.
Implementation of t-SNE and UMAP
We performed t-SNE and UMAP using the functions RunTSNE() and RunUMAP() respectively in the R package Seurat (version 3.2.3). The hyperparameters scDEED optimizes are perplexity in the RunTSNE() function and n.neighbors and min.dist in the RunUMAP() function. We used “seed.use = 100” when running RunTSNE() and RunUMAP() and kept the rest of the arguments as default.
Assessing the purity of cell clusters in the Hydra dataset
We used the function CalculateRogue() in the R package Rogue (version 2.0.0) to calculate the ROGUE statistic, which measures the purity of a cell cluster. The larger the ROGUE value, the purer (or more homogeneous) the cell cluster. In the Hydra dataset, the ROGUE values of the five clusters of neuron ectodermal cells are 0.710 (neuron ec1), 0.784 (neuron ec2), 0.714 (neuron ec3), 0.793 (neuron ec4), and 0.839 (neuron ec5).
Calculation of ARI
We used the function ARI()in the R package aricode (version 1.0.2) to calculate the adjusted Rand index (ARI)39, which represents the agreement between two sets of labels (e.g., a set of cells’ cluster labels and a set of cells’ annotated type labels) with adjustment for chance agreement of labels. ARI ranges from 0 to 1, with 1 indicating perfect agreement and 0 indicating no agreement beyond random chance.
Selection of genes in heatmaps
In every heatmap, unless otherwise specified, we plotted the top 300 genes that have the largest expression variances (based on the gene expression values before the PCA step in the Seurat package) across the cells shown in the heatmap.
DE gene identification
Differentially expressed (DE) genes were found using FindMarkers()function in the R package Seurat (version 4.3.0.1) with the Wilcoxon rank-sum test (the default setting).
Evaluation metrics for local and global preservation
We evaluated the preservation of information as in ref. 20, using the following two metrics.
KNN reflects the preservation of local information, i.e., the average proportion of the K = 10 nearest neighbors in the pre-embedding PC space that remain in the set of K = 10 nearest neighbors in the 2D-embedding space. KNN was also used in ref. 44.
K-nearest clusters (KNC) reflects the preservation of global information, i.e., the average proportion of the K = 4 nearest clusters in the pre-embedding PC space that remain in the set of K = 4 nearest clusters in the 2D-embedding space. Note that for KNC, we deviated from ref. 20 by defining each cluster center as the median rather than the mean because the median is more robust to outliers.
Datasets
Whenever preprocessed datasets were available, they were directly used in this study. Otherwise, datasets were preprocessed in the same way as in the original studies that generated the data. Below is the preprocessing detail for every dataset. The preprocessed data and code are available on Zenodo (https://zenodo.org/record/7216361#.ZDNgd-zMLJ8).
Hydra
The dataset Hydra/Hydra_Seurat_Whole_Transcriptome.rds (from the original study) contains the transcriptomes of n = 25,052 single Hydra polyp cells sequenced by Drop-seq, with the cells labeled with cluster labels, and 33,391 genes’ scaled expression levels processed by Seurat28. The data from the original study was archived at NCBI GEO with the accession code GSE121617. Following the original study, we used K = 31 PCs in Step 2 of scDEED, and we used the default RunTSNE(dims = 1:5) in the Seurat R package as the 2D-embedding method. The preprocessing code is in Hydra/data_processing_hydra.Rmd, and the preprocessed dataset is in Hydra/Hydra.rds.
We ran scDEED with the candidate perplexity values 10, 30, …, 390, and 410, as well as the value 40 used in the original study. The running time was 2.80 hours (see “Computing environment”).
CAR-T
In patients with B cell malignancies, lymphodepletion chemotherapy followed by infusion of CD19-specific chimeric antigen receptor modified-T (CAR-T) cells is known to generate anti-tumor responses. The dataset was produced to understand the clonal composition of CAR-T cells in the infusion products (IP) after the adoptive transfer. In particular, the dataset contains a sample of 10 patients who received CD19-specific CAR-T cells, and it is representative of the population in terms of age, sex, adverse events, clinical outcome, lymphodepletion therapy, and cell dose. Using the 10x Genomics platform, single-cell RNA-seq data were generated from n = 62,167 CD8 + CAR-T cells sorted based on truncated human epidermal growth factor receptor (EGFRt) expression from the IP and blood at the early (day 7–14), late (day 26–30), and very late (day 83–112) time points after infusion. This dataset is in the file CART/raw.expMatrix.csv, downloaded from the Gene Expression Omnibus (GEO) with the accession code GSE125881. Following the original study, we used K = 15 PCs in Step 2 of scDEED, and we used the default RunTSNE(dims = 1:5) in the Seurat R package as the 2D-embedding method. The preprocessing code is in CART/data_processing_CART.Rmd, and the preprocessed dataset is in CART/seuratObj_v3.RData.
We ran scDEED with the candidate perplexity values 20, 50, …, 380, 410, 450, 500, …, 750, and 800, as well as the value 30 used in the original study. The running time was 14.37 h (see “Computing environment”)
Alveolar
This dataset was constructed to learn the cell-cell communication during the regeneration process after bleomycin-induced lung injury. It contains whole-organ single cell suspensions from mice, from six-time points after injury and uninjured control lungs with four replicate mice per time point on average. Single-cell transcriptomes from about 1000 cells per individual mouse were carried out using the Dropseq workflow, leading to a sample of n = 29,297 cells in the final dataset. The raw dataset is available at NCBI GEO with the accession code GSE141259. Following the original study, we used K = 50 independent components in Step 2 of scDEED to obtain the pre-embedding space prior to applying UMAP in the Seurat R package, RunUMAP(dims = 1:50). Unlike RunTSNE, Seurat requires the user to specify the input dimension for UMAP. The preprocessing code is in Alveolar/data_processing_Alveolar.Rmd, and the preprocessed dataset is in Alveolar/Seurat_v3.RData.
We ran scDEED with the candidate n.neighbors values 5, 6, …, 9, 10, 15, 20, …, 45, 50, 80, 160, …, 240, and 320; and candidate min.dist values 0.0125, 0.05, 0.1, 0.2, …, 0.7, and 0.8. The running time was 1.37 h for marginal optimization of n.neighbors, 38.59 min for marginal optimization of min.dist, and 12.42 h for joint optimization (see “Computing environment”).
Samusik
Cells were gathered from bone marrow samples, and cell surface markers were used for CyTOF analysis. Data were normalized and annotated with clusters and the hand-gated populations. Doublets and neutrophils were removed. The original dataset is available at [https://figshare.com/s/9c3a0136f12b97f1dadd]14. The final dataset has n = 841,644 cells. Following the original study, we used p = 38 genetic markers as the pre-embedding space (optional Step 2 was omitted) before applying UMAP using the R command RunUMAP(features = feature_list), where feature_list refers to the 38 markers. The preprocessing code is in Samusik/data_processing_samusik01.Rmd, and the preprocessed dataset is in Samusik/samusik01_seurat.Rdata.
We ran scDEED with the same candidate n.neighbors and min.dist values as for the Alveolar dataset. The running time was 12.43 h for marginal optimization of n.neighbors, 4.89 h for marginal optimization of min.dist, and 3.77 days for joint optimization (see “Computing environment”).
Human PBMC
This dataset was gathered by the Broad Institute to compare seven single cell/single nucleus sequencing methods38. The original study manually annotated cells based on canonical cell markers. Here, we focused on three sequencing methods (inDrops, DropSeq, and SeqWell) and four common cell types Cytotoxic T cell, CD4 + T cell, CD14+ Monocyte, and B cell. This resulted in n = 5858 cells for inDrops, n = 5801 cells for DropSeq, and n = 3626 cells for SeqWell. The entire dataset is available as pbmcsca.SeuratData in the R package SeuratData (also available at NCBI GEO with the accession number GSE132044). The subset of data we analyzed is in Across_Techniques/Seurat.Rdata. We used K = 50 PCs in Step 2 of scDEED. The 2D-embedding space was obtained using RunUMAP(dims = 1:50) and RunTSNE(dims = 1:5).
We ran scDEED with the candidate perplexity values 5, 10, …., 135, and 140; n.neighbors values 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 80, 160, and 240; and min.dist values 0.0125, 0.05, 0.1, 0.3, 0.5, 0.7, and 0.9.
Marrow
This dataset was used in the EMBEDR paper25 and is a subset of a single-cell transcriptome of Mus. musculus42. Cells were harvested from mice and sorted with fluorescent-activated cell sorting (FACS). Sequencing was done using the Smart-seq2 protocol with Illumina sequencing. The original dataset is available as Marrow/Marrow_counts at
[https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733]. The original dataset contained n = 5037 cells. Following the preprocessing notebook available at EMBEDR’s GitHub (Marrow/Marrow preprocessing.ipynb), we obtained n = 4821 cells (Marrow/Marrow_processed.csv). This differs from EMBEDR’s reported n = 4771 cells after the preprocessing (in the EMBEDR publication). However, the EMBEDR’s authors’ code indicated that all n = 5037 cells were used for analysis.
Despite this discrepancy, our preprocessed n = 4821 cells with 17,303 genes replicated the EMBEDR results fairly well. For fair comparison, the processed data was used for all analyses. Following the EMBEDR tutorial at
[https://github.com/ejohnson643/EMBEDR/blob/master/projects/Figures/Figure_04v1_GlobalParameterSweep.ipynb], we used K = 50 PCs as the pre-embedding space prior to EMBEDR optimization. For scDEED optimization, we used K = 16 PCs (chosen from an elbow plot) in Step 2 of scDEED. The low dimensional space and visualizations were obtained using the default Seurat R command RunTSNE(dims = 1:5).
We ran scDEED with the candidate values 10, 15, 20, 30, 40, 50, 60, 80, 100, 150, 200, 300, 350, 500, 600, 800, 1000, 1300, 1700, 2200, 2900, and 3700 to mimic the Keff values in the original Fig. 4 in the EMBEDR paper25. The running time was 10 min (see “Computing environment”).
DG
This dentate gyrus dataset was measured to elucidate the gyrus cell lineage. The 10x Genomics processed data used in the tutorial of the R package Velocyto (version 0.6) was analyzed (Velocyto/10×43.1_loom). The tutorial is available at
Cells were annotated using the Louvain clustering at the default resolution (0.8) based on the marker genes from the original paper (Velocyto/data_annotated.Rds). The dataset consists of n = 3396 cells and 92,135 features across the spliced and unspliced assays and is accessible at
[http://pklab.med.harvard.edu/velocyto/DG1/10X43_1.loom]. For scDEED optimization, we used K = 12 PCs in Step 2 of scDEED and obtained the low dimensional space using the default Seurat R command RunTSNE(dims = 1:5). Final visualization used the command RunTSNE(dims = 1:12).
We ran scDEED with the candidate perplexity values 20, 50, …, 380, 410, 450, 500, 600, 700, and 800.
Simulated data
The 20 simulated datasets (Simulated_Data/Simulated_data_1.Rds, …, Simulated_Data/Simulated_data_20.Rds) were generated by scDesign343, which was trained on a built-in dataset of mouse small intestinal epithelial cells of the R package scDesign248 (GEO accession code GSE9233249). To increase the distances between three cell types (Enterocyte. Progenitor, TA.Early, and Stem), we independently permuted each cell type’s gene expression mean values (every gene has a mean parameter value in each cell type) across all genes. To ensure that gene expression did not largely deviate from the specified cluster means, the 100 largest dispersion parameters were divided by 150. In total, we had 10,000 genes and n = 7217 cells. For scDEED optimization, we used K = 12 PCs in Step 2 of scDEED. To obtain the 2D-embedding space, we used RunUMAP(dims = 1:12) and the default RunTSNE(dims = 1:5). For EMBEDR, we also used K = 12 PCs for its optimization.
To compare t-SNE and UMAP fairly, we calculated the final 2D embeddings, which were used for KNN and KNC comparison, all using K = 12 PCs as the pre-embedding space.
We ran scDEED with the candidate perplexity values 20, 50, …, 380, 410, 450, 500, …, 750, and 800 (default settings in scDEED).
RNA velocity
RNA velocity was performed using Velocyto (version 0.6) with default settings using the tutorial available at:
Comparison with EMBEDR and DynamicViz
EMBEDR is available as a Python package. Hyperparameter sweeps were performed following the available tutorials with default settings, including the option to use all available processors. Applying EMBEDR to the Hydra dataset, we used the suggested 25 data embeddings with 15 null embeddings. Applying EMBEDR to the 20 simulated datasets, we used 5 data embeddings with ten null embeddings to save computational time. Since EMBEDR requires the user to provide a list of candidate hyperparameter parameters, we used the default lists of perplexity and n.neighbors values in scDEED. EMBEDR does not sweep over min.dist, so for a fair comparison, we fixed min.dist at 0.1 (default EMBEDR setting) when using scDEED.
EMBEDR categorizes cells as well-embedded or noisy. For consistency in terminology between EMBEDR and scDEED, we considered well-embedded cells to have trustworthy embeddings and noisy cells to have dubious embeddings. Specifically, we defined dubious cell embeddings to be the cells with EMBEDR p values above 0.1 based on the EMBEDR paper25, which considers all cells with p values > 0.1 to have similar levels of noise.
DynamicViz is also available as a Python package. Parameter sweeps are not built in but can be iterated. We were able to successfully use this package for t-SNE, yet for UMAP there were some errors. Due to this and the conceptual difference in the definition of dubious cell embeddings between DynamicViz and scDEED, we decided to omit DynamicViz from our analysis.
Statistics and reproducibility
No statistical method was used to predetermine the sample size. For reproducibility, please refer to the Zenodo deposit ([https://zenodo.org/record/7216361#.ZDNgd-zMLJ8]) for all the code used to generate figures, as well as the processed datasets.
Versions of R packages
Seurat version 4.3.0.1: all the t-SNE and UMAP analyses except the EMBEDR analysis.
SeuratData version 0.2.2: for the dataset “pmbcsca.SeuratData”.
doParallel version 1.0.15; foreach version 1.5.0: for parallel computing and looping.
ggsci version 2.9: for plotting.
Rogue version 2.0.0: for assessing the purity of a cell cluster.
distances version 0.1.8: fast computation for pairwise distances between vectors.
velocyto.R version 0.6: RNA velocity analysis.
scDesign3 version 0.99.6 for data simulation.
Other packages:
Rfast version 1.9.9; VGAM version 1.1.3; pracma version 2.2.9; ggplot2 version 3.3.2; SeuratWrappers version 0.3.0; aricode version 1.0.2
Computing environment
All algorithms and code were executed on an iMac with 3.6 GHz Intel Core i9 processor, 64GB memory, and Mojave 10.14 system. For the data analysis performed in this paper, six cores were used.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.