Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus

DOI: 10.1162/089120101300346787

关键词:

摘要: Bigrams and trigrams are commonly used in statistical natural language processing; this paper will describe techniques for working with much longer n-grams. Suffix arrays (Manber Myers 1990) were first introduced to compute the frequency location of a substring (n-gram) sequence (corpus) length N. To frequencies over all N(N + 1)/2 substrings corpus, grouped into manageable number equivalence classes. In way, prohibitive computation is reduced This presents both algorithms code that term (tf) document (dr)for n-grams two large corpora, an English corpus 50 million words Wall Street Journal Japanese 216 characters Mainichi Shimbun.The second half uses these find "interesting" substrings. Lexicographers have been interested high mutual information (MI) where joint higher than what would be expected by chance, assuming parts n-gram combine independently. Residual inverse (RIDF) compares another model chance terms particular distributed randomly throughout collection. MI tends pick out phrases noncompositional semantics (which often violate independence assumption) whereas RIDF highlight technical terminology, names, good keywords retrieval tend exhibit nonrandom distributions documents). The combination better either itself word extraction task.

aclweb.org PDF 下载加速

tsukuba.ac.jp PDF 下载加速

mitpressjournals.org PDF 下载加速

acm.org LINK 下载加速

doi.org LINK 下载加速

sci-hub.se PDF 下载加速

参考文章(18)

Masaaki Nagata, Automatic Extraction of New Words from Japanese Texts using Generalized Forward-Backward Search. empirical methods in natural language processing. ,(1996)

Frederick Jelinek, Statistical methods for speech recognition ,(1997)

Ricardo A. Baeza-Yates, Gaston H. Gonnet, Tim Snider, New indices for text: PAT Trees and PAT arrays Information Retrieval. pp. 66- 82 ,(1992)

Patrick Hanks, Kenneth Ward Church, Word association norms, mutual information, and lexicography Computational Linguistics. ,vol. 16, pp. 22- 29 ,(1990) , 10.5555/89086.89095

Kenji Kita, Yasuhiko Kato, Takashi Omoto, Yoneo Yano, A Comparative Study of Automatic Extraction of Collocations from Corpora : Mutual Information vs. Cost Criteria 自然言語処理 = Journal of natural language processing. ,vol. 1, pp. 21- 33 ,(1994) , 10.5715/JNLP.1.21

Dan Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology ,(1997)

Eugene Charniak, Statistical Language Learning ,(1994)

Makoto Nagao, Shinsuke Mori, A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese Proceedings of the 15th conference on Computational linguistics -. ,vol. 1, pp. 611- 615 ,(1994) , 10.3115/991886.991994

Douglas B. Paul, Janet M. Baker, The design for the wall street journal-based CSR corpus Proceedings of the workshop on Speech and Natural Language - HLT '91. pp. 357- 362 ,(1992) , 10.3115/1075527.1075614

10.

Joy Hendry, Masayoshi Shibatani, The languages of Japan ,(1990)

Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus

来源期刊

我的账户

Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus

来源期刊

相似文章 10

我的账户