作者: Mikio Yamamoto , Kenneth W. Church
DOI: 10.1162/089120101300346787
关键词:
摘要: Bigrams and trigrams are commonly used in statistical natural language processing; this paper will describe techniques for working with much longer n-grams. Suffix arrays (Manber Myers 1990) were first introduced to compute the frequency location of a substring (n-gram) sequence (corpus) length N. To frequencies over all N(N + 1)/2 substrings corpus, grouped into manageable number equivalence classes. In way, prohibitive computation is reduced This presents both algorithms code that term (tf) document (dr)for n-grams two large corpora, an English corpus 50 million words Wall Street Journal Japanese 216 characters Mainichi Shimbun.The second half uses these find "interesting" substrings. Lexicographers have been interested high mutual information (MI) where joint higher than what would be expected by chance, assuming parts n-gram combine independently. Residual inverse (RIDF) compares another model chance terms particular distributed randomly throughout collection. MI tends pick out phrases noncompositional semantics (which often violate independence assumption) whereas RIDF highlight technical terminology, names, good keywords retrieval tend exhibit nonrandom distributions documents). The combination better either itself word extraction task.