Random indexing of text samples for latent semantic analysis

作者: Jan Kristoferson , Pentti Kanerva , Anders Holst

DOI:

关键词:

摘要: Random Indexing of Text Samples for Latent Semantic Analysis Pentti Kanerva Jan Kristoferson Anders Holst kanerva@sics.se, janke@sics.se, aho@sics.se RWCP Theoretical Foundation SICS Laboratory Swedish Institute Computer Science, Box 1263, SE-16429 Kista, Sweden is a method computing vectors|and it has several randomly placed ; 1s and high-dimensional semantic vectors, or context 1s, with the rest 0s (e.g., four each 1 1, words from their co-occurrence statistics. An exper- eight non-0s in 1,800, instead one non-0 30,000 iment by Landauer & Dumais (1997) covers vocabu- as above). Thus, we would accumulate same data lary 60,000 (unique letter strings delimited into 1,800 words-by-contexts matrix word-space characters) contexts (text samples 30,000. \documents about 150 each). The are Our been veried dierent data, rst collected ten-million-word \TASA corpus consisting 79,000- matrix, row representing word vocabulary (when truncated after 8th column text sample so that character) 37,600 samples. were accu- entry gives frequency given mulated 79,000 sample. frequencies normalized, which was normalized thresholding transformed Singular-Value 0s, 1s. unnormalized 1,800-dimensional Decomposition (SVD) reducing its original doc- vectors gave 35{44% correct TOEFL test ument dimensions much smaller number latent ones 48{51% correct, cor- dimensions, 300 proving to be optimal. Thus respond Dumais' 36% normal- represented 300-dimensional vectors. ized 30,000-dimensional before SVD, dier- point all this capture ent (see meaning. demonstrate can further, example SVD synonym called (for \Test Of English LSA, except smaller. Mathematically, 30,000- 37,600-dimensional in- Foreign Language ). For word, alterna- dex orthogonal, whereas 1,800-dimen- tives given, \contestant asked nd that's most synonymous. Choosing at random sional only nearly orthogonal. They seem yield 25% correct. However, when seman- work just well, addition they more tic vector compared \brainlike less aected sam- alternatives, correlates ples (1,800-dimensional index cover wide- highly alternative 64% cases. ranging samples). We have used such based on also narrow windows, dimensional result not getting 62{70% conclude good: authors dexing deserves studied understood fully. Acknowledgments. This research supported reorganization information somehow Japan's Ministry International Trade Industry responds human psychology. under Real World Computing Partnership distributed (MITI) (RWCP) TASA 80 representations, models brainlike representation items program. made available us courtesy Pro- (Kanerva, 1994; Sjodin, 1999). fessor Thomas Landauer, University Colorado. In poster report use repre- sentation reduce dimensionality matrix. explained Kanerva, P. (1994). References Spatter Code encoding looking concepts many levels. M. Marinaro G. above. Assume Morasso (eds.), ICANN '94, Proc. Int'l Conference 30,000-bit single marking place Articial Neural Networks (Sorrento, Italy), vol. list samples, call sample's pp. 226{229. London: Springer-Verlag. (i.e., n th bit P., (1999). Stochastic Pattern 1|the unitary lo- Computing. 2000 Sym- cal). Then bosium (Report TR-99-002, 271{276). Tsukuba- gotten following procedure: every time city, Japan: Partnership. w occurs sample, added . T. K., Dumais, S. (1997). A solution procedure accumulating words- Plato's problem: by-contexts theory acquisition, induction, representa- unitary. text-sample's \small tion knowledge. Psychological Review 104 (2):211{ comparison|we

参考文章(0)