作者: Ioannis Mitliagkas , Constantine Caramanis , Prateek Jain
DOI:
关键词:
摘要: In this paper, we consider a streaming one-pass-over-the-data model for Principal Component Analysis (PCA). The input, in case, is stream ofp-dimensional vectors, and the output collection of k, p-dimensional principal components that span best approximating subspace. Consequently, minimum memory requirement such problems O(kp). Yet standard PCA algorithm requires us to form empirical covariance matrix, typically dense p hence requiring O(p 2 ) memory. Although there exist several incremental algorithms require O(kp) memory, our understanding, these methods currently do not have known nite-sample performance bounds. That is, high-dimensional setting where number samples dimensionality scale together, no provably correct algorithm. This paper considers simple but important problem. We give what knowledge, rst only makes single pass over data, whose matches batch up logarithmic factors.