Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus

作者: Mikio Yamamoto , Kenneth W. Church

DOI: 10.1162/089120101300346787

关键词:

摘要: Bigrams and trigrams are commonly used in statistical natural language processing; this paper will describe techniques for working with much longer n-grams. Suffix arrays (Manber Myers 1990) were first introduced to compute the frequency location of a substring (n-gram) sequence (corpus) length N. To frequencies over all N(N + 1)/2 substrings corpus, grouped into manageable number equivalence classes. In way, prohibitive computation is reduced This presents both algorithms code that term (tf) document (dr)for n-grams two large corpora, an English corpus 50 million words Wall Street Journal Japanese 216 characters Mainichi Shimbun.The second half uses these find "interesting" substrings. Lexicographers have been interested high mutual information (MI) where joint higher than what would be expected by chance, assuming parts n-gram combine independently. Residual inverse (RIDF) compares another model chance terms particular distributed randomly throughout collection. MI tends pick out phrases noncompositional semantics (which often violate independence assumption) whereas RIDF highlight technical terminology, names, good keywords retrieval tend exhibit nonrandom distributions documents). The combination better either itself word extraction task.

参考文章(18)
Masaaki Nagata, Automatic Extraction of New Words from Japanese Texts using Generalized Forward-Backward Search. empirical methods in natural language processing. ,(1996)
Ricardo A. Baeza-Yates, Gaston H. Gonnet, Tim Snider, New indices for text: PAT Trees and PAT arrays Information Retrieval. pp. 66- 82 ,(1992)
Patrick Hanks, Kenneth Ward Church, Word association norms, mutual information, and lexicography Computational Linguistics. ,vol. 16, pp. 22- 29 ,(1990) , 10.5555/89086.89095
Kenji Kita, Yasuhiko Kato, Takashi Omoto, Yoneo Yano, A Comparative Study of Automatic Extraction of Collocations from Corpora : Mutual Information vs. Cost Criteria 自然言語処理 = Journal of natural language processing. ,vol. 1, pp. 21- 33 ,(1994) , 10.5715/JNLP.1.21
Eugene Charniak, Statistical Language Learning ,(1994)
Makoto Nagao, Shinsuke Mori, A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese Proceedings of the 15th conference on Computational linguistics -. ,vol. 1, pp. 611- 615 ,(1994) , 10.3115/991886.991994
Douglas B. Paul, Janet M. Baker, The design for the wall street journal-based CSR corpus Proceedings of the workshop on Speech and Natural Language - HLT '91. pp. 357- 362 ,(1992) , 10.3115/1075527.1075614
Joy Hendry, Masayoshi Shibatani, The languages of Japan ,(1990)