System and method for identifying the language of written text having a plurality of different length n-gram profiles

作者: Miguel Cardoso de Campos

DOI:

关键词:

摘要: A window of letters is identified within a text sample input. If the contains matches to reference letter sequences (RLS) contained in multiple sets n-gram language profiles (profiles), then longest match kept and scored for each language. Scoring based on frequency parameters matched RLS The incrementally shifted through matching scoring done window. At end input, having highest cumulative score as sample's may be improved by restricting longer full words, using two passes where second pass disregards languages that are not near during first pass, favoring complete words scoring, increasing does frequently appear many languages. enhanced removing some if meet predefined threshold variable threshold.

参考文章(12)
Philip J. Mullan, Walter S. Rosenbaum, Multi-channel recognition discriminator ,(1976)
Lorin P. Netsch, Barbara J. Wheatley, Yeshwant K. Muthusamy, Periagaram K. Rajasekaran, Automatic language identification method and system ,(1994)
Michael S. Register, Narasimhan Kannan, Method and apparatus for text classification ,(1992)
Robert Charles Paulsen, Michael John Martino, Determining a natural language shift in a computer document ,(1996)
Dean Sturtevant, Daniell Stevens, Joel M. Gould, Charles E. Ingold, Michael J. Newman, Allan Gold, David Abrahams, Robert Roth, Error correction in speech recognition ASAJ. ,vol. 109, pp. 30- ,(2001)
Peter F. Brown, Speech recognition system for natural language translation Journal of the Acoustical Society of America. ,vol. 97, pp. 1365- 1365 ,(1993) , 10.1121/1.412155