作者: Langmead B , Hicks Sc , Baker Dn , Dyjack N , Braverman
DOI: 10.1101/2021.03.24.436859
关键词: Divergence (statistics) 、 Cluster analysis 、 Count data 、 Bhattacharyya distance 、 Reservoir sampling 、 Distance measures 、 Pattern recognition 、 Dimensionality reduction 、 Euclidean distance 、 Computer science 、 Artificial intelligence
摘要: Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar profiles. We describe new methods and open source library, minicore, for efficient k-means++ center finding k-means scRNA-seq data. Minicore works sparse count data, as it emerges from typical experiments, well dense data after dimensionality reduction. Minicore9s novel vectorized weighted reservoir sampling algorithm allows find initial centers 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster Euclidean distance, but also supports wider class measures like Jensen-Shannon Divergence, Kullback-Leibler the Bhattacharyya which be directly applied probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn datasets millions cells. With careful handling priors, implements these distance only minor (<2-fold) speed differences among all distances. We show that pipeline consisting k-means++, localsearch++ mini-batch minutes, less 10GiB RAM. This memory-efficiency enables atlas-scale on laptops other commodity hardware. Finally, we report findings give clusterings are most consistent known type labels.