Character-based Language Model

作者: Vít Baisa

DOI:

关键词:

摘要: Abstract. Language modelling and also other natural language process-ing tasks are usually based on words. I present here a more general yetsimpler approach to using much smaller units of textdata: character-based model (CBLM). 1 In this paper describethe underlying data structure the model, evaluate stan-dard measures (entropy, perplexity). As proof-of-concept an extrin-sic evaluation random sentence generator thismodel. Keywords: suffix array, LCP, trie, character-based, ran-dom text generator, corpus Introduction Current approaches almost utterly words.To work with words, input needs be tokenized which might bequite tricky for some languages. The tokenization cause errors whichare propagated following processing steps. But even if tokenizationwas 100% reliable, another problem emerges: word-based modelstreat similar words as completely unrelated. Consider two platypusand platypuses. former is contained in latter yet they will treatedcompletely independently. This issue can sorted out partially by usingfactored models [1] where lemmas morphological information(here singular vs. plural number same lemma) treated simultaneouslywith word forms.In most systems, n-grams(usually 3–4) Markov chain corresponding order only afinite fixed previous taken into account. propose amodel tackles above-mentioned problems. tokenizationis removed from process building since usessequences characters (or bytes) data. Words (byte sequences)which share prefix (bytes) stored place

参考文章(8)
R. Iyer, M. Ostendorf, M. Meteer, Analyzing and predicting language model improvements ieee automatic speech recognition and understanding workshop. pp. 254- 261 ,(1997) , 10.1109/ASRU.1997.659013
CE Shennon, Warren Weaver, A mathematical theory of communication Bell System Technical Journal. ,vol. 27, pp. 379- 423 ,(1948) , 10.1002/J.1538-7305.1948.TB01338.X
Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraçlar, Andreas Stolcke, Morph-based speech recognition and modeling of out-of-vocabulary words across languages ACM Transactions on Speech and Language Processing. ,vol. 5, pp. 1- 29 ,(2007) , 10.1145/1322391.1322394
Jeff A. Bilmes, Katrin Kirchhoff, Factored language models and generalized parallel backoff Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology companion volume of the Proceedings of HLT-NAACL 2003--short papers - NAACL '03. pp. 4- 6 ,(2003) , 10.3115/1073483.1073485
T.R. Niesler, P.C. Woodland, A variable-length category-based n-gram language model international conference on acoustics speech and signal processing. ,vol. 1, pp. 164- 167 ,(1996) , 10.1109/ICASSP.1996.540316
Vincent J. Della Pietra, Jennifer C. Lai, Stephen A. Della Pietra, Robert L. Mercer, Peter F. Brown, An estimate of an upper bound for the entropy of English Computational Linguistics. ,vol. 18, pp. 31- 40 ,(1992) , 10.5555/146680.146685
Thorsten Brants, Tomas Mikolov, Tony Robinson, Ciprian Chelba, Qi Ge, Phillipp Koehn, Mike Schuster, One billion word benchmark for measuring progress in statistical language modeling. conference of the international speech communication association. pp. 2635- 2639 ,(2014)
Vít Suchomel, Recent Czech Web Corpora RASLAN. pp. 77- 83 ,(2012)