摘要: Abstract. Language modelling and also other natural language process-ing tasks are usually based on words. I present here a more general yetsimpler approach to using much smaller units of textdata: character-based model (CBLM). 1 In this paper describethe underlying data structure the model, evaluate stan-dard measures (entropy, perplexity). As proof-of-concept an extrin-sic evaluation random sentence generator thismodel. Keywords: suffix array, LCP, trie, character-based, ran-dom text generator, corpus Introduction Current approaches almost utterly words.To work with words, input needs be tokenized which might bequite tricky for some languages. The tokenization cause errors whichare propagated following processing steps. But even if tokenizationwas 100% reliable, another problem emerges: word-based modelstreat similar words as completely unrelated. Consider two platypusand platypuses. former is contained in latter yet they will treatedcompletely independently. This issue can sorted out partially by usingfactored models [1] where lemmas morphological information(here singular vs. plural number same lemma) treated simultaneouslywith word forms.In most systems, n-grams(usually 3–4) Markov chain corresponding order only afinite fixed previous taken into account. propose amodel tackles above-mentioned problems. tokenizationis removed from process building since usessequences characters (or bytes) data. Words (byte sequences)which share prefix (bytes) stored place