Index wiki database: design and experiments

作者: A. A. Krizhanovsky

DOI:

关键词: Database indexComputer scienceInverted indexParsingLexiconMarkup languageLemmatisationDatabase designInformation retrievalIndex (publishing)Search engine indexingGerman

摘要: With the fantastic growth of Internet usage, information search in documents a special type called "wiki page" that is written using simple markup language, has become an important problem. This paper describes software architectural model for indexing wiki texts three languages (Russian, English, and German) interaction between components (GATE, Lemmatizer, Synarcher). The inverted file index database was designed visual tool DBDesigner. rules parsing Wikipedia are illustrated by examples. Two databases Russian (RW) Simple English (SEW) built compared. size RW order magnitude higher than SEW (number words, lexemes), though rate number pages found to be 14% Russian, acquisition new words lexicon 7% during period five months (from September 2007 February 2008). Zipf's law tested with both Wikipedias. entire source code generated freely available under GPL (GNU General Public License).

参考文章(26)
Iryna Gurevych, Torsten Zesch, Christof Müller, Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary language resources and evaluation. ,(2008)
Jean-Philippe Chancelier, S. L. Campbell, Ramine Nikoukhah, Modeling and Simulation in Scilab/Scicos Springer. ,(2006)
Dekang Lin, Patrick Pantel, Word-for-word glossing with contextually similar words north american chapter of the association for computational linguistics. pp. 78- 85 ,(2000)
Vincent D. Blondel, Pierre P. Senellart, Automatic extraction of synonyms in a dictionary ,(2002)
Christopher D. Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing ,(1999)
Marián Boguñá, Alessandro Flammini, Santo Fortunato, Filippo Menczer, How to make the top ten: Approximating PageRank from in-degree arXiv: Information Retrieval. ,(2005)
Andrew Krizhanovsky, Synonym search in Wikipedia: Synarcher arXiv: Information Retrieval. ,(2006)
Yannis Tzitzikas, Yannis Theoharis, Nikos Armenatzoglou, Georgia Troullinou, Yannis Marketakis, Giorgos Vasiliadis, Stella Kopidaki, Dimitris Velegrakis, Giorgos Linardakis, Panagiotis Papadakos, Kostas Karamaroudis, Lefteris Sardis, Vangelis Papathanasiou, Kostas Vandikas, Giannis Makrydakis, Petros Tsialiamanis, Manos Daskalakis, The Anatomy of Mitos Web Search Engine arXiv: Information Retrieval. ,(2008)
Stephen Robertson, Understanding inverse document frequency: on theoretical arguments for IDF Journal of Documentation. ,vol. 60, pp. 503- 520 ,(2004) , 10.1108/00220410410560582