YAWN: A Semantically Annotated Wikipedia XML Corpus

作者: Gjergji Kasneci , Fabian M. Suchanek , Ralf Schenkel

DOI:

关键词:

摘要: The paper presents YAWN, a system to convert the well-known and widely used Wikipedia collection into an XML corpus with semantically rich, self-explaining tags. We introduce algorithms annotate pages links concepts from WordNet thesaurus. This annotation process exploits categorical information in Wikipedia, which is high-quality, manually assigned source of information, extracts additional lists, utilizes invocations templates named parameters. give examples how such annotations can be exploited for high-precision queries.

参考文章(23)
A. Souzis, Building a semantic wiki IEEE Intelligent Systems. ,vol. 20, pp. 87- 91 ,(2005) , 10.1109/MIS.2005.83
Arvind Arasu, Hector Garcia-Molina, Stanford University, Extracting structured data from Web pages international conference on management of data. pp. 337- 348 ,(2003) , 10.1145/872757.872799
Ralf Schenkel, Anja Theobald, Gerhard Weikum, Semantic Similarity Search on Semistructured Data with the XXL Search Engine Information Retrieval. ,vol. 8, pp. 521- 545 ,(2005) , 10.1007/S10791-005-0746-3
Eugene Agichtein, Scaling Information Extraction to Large Document Collections. IEEE Data(base) Engineering Bulletin. ,vol. 28, pp. 3- 10 ,(2005)
Mounia Lalmas, Stefan M. Rüger, Anastasios Tombros, Theodora Tsikrika, Alexei Yavlinsky, Ralf Schenkel, Martin Theobald, Andy MacFarlane, Structural Feedback for Keyword-Based XML Retrieval Untitled Event. pp. 326- 337 ,(2006)
Gerhard Weikum, Jens Graupmann, Ralf Schenkel, The SphereSearch engine for unified ranked retrieval of heterogeneous XML and web documents very large data bases. pp. 529- 540 ,(2005)
Pavel Brazdil, Gerhard Weikum, George Tsatsaronis, Michalis Vazirgiannis, Luís Torgo, Rui Camacho, Martin Theobald, Alípio Jorge, Gama Joao, Dimitrios Mavroeidis, Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification Untitled Event. pp. 181- 192 ,(2005)
B. Fazzinga, S. Flesca, A. Tagarelli, Learning Robust Web Wrappers Lecture Notes in Computer Science. pp. 736- 745 ,(2005) , 10.1007/11546924_72
Sihem Amer-Yahia, SungRan Cho, Divesh Srivastava, Tree Pattern Relaxation extending database technology. pp. 496- 513 ,(2002) , 10.1007/3-540-45876-X_32
Andrew Trotman, Börkur Sigurbjörnsson, Narrowed extended XPath i (NEXI) INEX'04 Proceedings of the Third international conference on Initiative for the Evaluation of XML Retrieval. pp. 16- 40 ,(2004) , 10.1007/11424550_2