Identifying a property of a document

作者: Stewart Yang , Xin Liu

DOI:

关键词:

摘要: Methods, systems and apparatus, including computer program products, for identifying properties of an electronic document. In one aspect, a sequence bytes representing text in document is received. A plurality byte-n-grams are identified from the bytes. For multiple encodings, respective likelihood each byte-n-gram occurring encodings identified. encoding score determined. most likely based on highest among scores. another characters, having encoding, The segmented into features, corresponding to two or more characters. languages determined features language model.

参考文章(77)
Joseph Landry, Ernest Brody, Glenn Ward, Paul Beshah, Donna Koenig, Multimedia laboratory notebook ,(2001)
Bruce Johnson, DeAnna Johnson, Paul Leclerc, Universal search engine ,(2001)
Davide Turcato, Gordon W. Tisher, Daniel C. Fass, Janine T. Toole, James Devlan Nicholson, Frederick P. Popowich, A method and system for adapting synonym resources to specific domains ,(2001)