minicore: Fast scRNA-seq clustering with various distances

作者: Langmead B , Hicks Sc , Baker Dn , Dyjack N , Braverman

DOI: 10.1101/2021.03.24.436859

关键词: Divergence (statistics)Cluster analysisCount dataBhattacharyya distanceReservoir samplingDistance measuresPattern recognitionDimensionality reductionEuclidean distanceComputer scienceArtificial intelligence

摘要: Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar profiles. We describe new methods and open source library, minicore, for efficient k-means++ center finding k-means scRNA-seq data. Minicore works sparse count data, as it emerges from typical experiments, well dense data after dimensionality reduction. Minicore9s novel vectorized weighted reservoir sampling algorithm allows find initial centers 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster Euclidean distance, but also supports wider class measures like Jensen-Shannon Divergence, Kullback-Leibler the Bhattacharyya which be directly applied probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn datasets millions cells. With careful handling priors, implements these distance only minor (<2-fold) speed differences among all distances. We show that pipeline consisting k-means++, localsearch++ mini-batch minutes, less 10GiB RAM. This memory-efficiency enables atlas-scale on laptops other commodity hardware. Finally, we report findings give clusterings are most consistent known type labels.

参考文章(24)
Philip Brennecke, Simon Anders, Jong Kyoung Kim, Aleksandra A Kołodziejczyk, Xiuwei Zhang, Valentina Proserpio, Bianka Baying, Vladimir Benes, Sarah A Teichmann, John C Marioni, Marcus G Heisler, Accounting for technical noise in single-cell RNA-seq experiments Nature Methods. ,vol. 10, pp. 1093- 1095 ,(2013) , 10.1038/NMETH.2645
Deng Cai, Xinlei Chen, Large scale spectral clustering with landmark-based representation national conference on artificial intelligence. pp. 313- 318 ,(2011)
David Arthur, Sergei Vassilvitskii, k-means++: the advantages of careful seeding symposium on discrete algorithms. pp. 1027- 1035 ,(2007) , 10.5555/1283383.1283494
Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, Joydeep Ghosh, John Lafferty, Clustering with Bregman Divergences siam international conference on data mining. ,vol. 6, pp. 1705- 1749 ,(2005) , 10.1137/1.9781611972740.22
D. Sculley, Web-scale k-means clustering the web conference. pp. 1177- 1178 ,(2010) , 10.1145/1772690.1772862
S. Lloyd, Least squares quantization in PCM IEEE Transactions on Information Theory. ,vol. 28, pp. 129- 137 ,(1982) , 10.1109/TIT.1982.1056489
Deanna Needell, Nathan Srebro, Rachel Ward, Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm neural information processing systems. ,vol. 155, pp. 1017- 1025 ,(2014) , 10.1007/S10107-015-0864-7
Orit Rozenblatt-Rosen, Michael J. T. Stubbington, Aviv Regev, Sarah A. Teichmann, The Human Cell Atlas: from vision to reality Nature. ,vol. 550, pp. 451- 453 ,(2017) , 10.1038/550451A
Junyue Cao, Malte Spielmann, Xiaojie Qiu, Xingfan Huang, Daniel M Ibrahim, Andrew J Hill, Fan Zhang, Stefan Mundlos, Lena Christiansen, Frank J Steemers, Cole Trapnell, Jay Shendure, None, The single-cell transcriptional landscape of mammalian organogenesis Nature. ,vol. 566, pp. 496- 502 ,(2019) , 10.1038/S41586-019-0969-X
Brian Hie, Hyunghoon Cho, Benjamin DeMeo, Bryan Bryson, Bonnie Berger, Geometric Sketching Compactly Summarizes the Single-Cell Transcriptomic Landscape. Cell systems. ,vol. 8, pp. 483- ,(2019) , 10.1016/J.CELS.2019.05.003