作者: A. A. Krizhanovsky
DOI:
关键词: Database index 、 Computer science 、 Inverted index 、 Parsing 、 Lexicon 、 Markup language 、 Lemmatisation 、 Database design 、 Information retrieval 、 Index (publishing) 、 Search engine indexing 、 German
摘要: With the fantastic growth of Internet usage, information search in documents a special type called "wiki page" that is written using simple markup language, has become an important problem. This paper describes software architectural model for indexing wiki texts three languages (Russian, English, and German) interaction between components (GATE, Lemmatizer, Synarcher). The inverted file index database was designed visual tool DBDesigner. rules parsing Wikipedia are illustrated by examples. Two databases Russian (RW) Simple English (SEW) built compared. size RW order magnitude higher than SEW (number words, lexemes), though rate number pages found to be 14% Russian, acquisition new words lexicon 7% during period five months (from September 2007 February 2008). Zipf's law tested with both Wikipedias. entire source code generated freely available under GPL (GNU General Public License).