Indexing Heterogeneous XML for Full-Text Search

作者: Miro Lehtonen

DOI:

关键词:

摘要: XML documents are becoming more and common in various environments. In particular, enterprise-scale document management is commonly centred around XML, desktop applications as well online collections soon to follow. The growing number of increases the importance appropriate indexing methods search tools keeping information accessible. Therefore, we focus on content that stored format develop such methods. Because used for different kinds ranging all way from records data fields narrative full-texts, Information Retrieval facing a new challenge identifying which subject queries should be indexed full-text search. response this challenge, analyse relation character tags order separate data. As result, able both reduce size index by 5-6% improve retrieval precision select fragments indexed. Besides being challenging, comes with many unexplored opportunities not paid much attention literature. For example, authors often tag they want emphasise using typeface stands out. tagged constitutes phrases descriptive useful They simple detect documents, but also possible confuse other inline-level text. Nonetheless, results seem when detected given additional weight index. Similar improvements reported related associated including titles, captions, references.

参考文章(112)
Makoto Nagao, Jun-ichi Tsujii, Koji Yada, Toshihiro Kakimoto, An English Japanese machine translation system of the titles of scientific and engineering papers Proceedings of the 9th conference on Computational linguistics -. pp. 245- 252 ,(1982) , 10.3115/991813.991852
Soumen Chakrabarti, Mukul Joshi, Vivek Tawde, Enhanced topic distillation using text, markup tags, and hyperlinks international acm sigir conference on research and development in information retrieval. pp. 208- 216 ,(2001) , 10.1145/383952.383990
Jim Challenger, Paul Dantzig, Arun Iyengar, Karen Witting, A fragment-based approach for efficiently creating dynamic web content ACM Transactions on Internet Technology. ,vol. 5, pp. 359- 389 ,(2005) , 10.1145/1064340.1064343
Vincent Aguilera, Sophie Cluet, Tova Milo, Pierangelo Veltri, Dan Vodislav, Views in a large-scale XML repository The VLDB Journal The International Journal on Very Large Data Bases. ,vol. 11, pp. 238- 255 ,(2002) , 10.1007/S00778-002-0065-X
Michael Priestley, DITA XML Proceedings of the 19th annual international conference on Computer documentation - SIGDOC '01. pp. 152- 156 ,(2001) , 10.1145/501516.501547
Suzanne Liebowitz Taylor, Deborah A. Dahl, Mark Lipshutz, Carl Weir, Lewis M. Norton, Roslyn Nilson, Marcia Linebarger, Integrated text and image understanding for document understanding Proceedings of the workshop on Human Language Technology - HLT '94. pp. 421- 426 ,(1994) , 10.3115/1075812.1075910
Amit P. Sheth, James A. Larson, Federated database systems for managing distributed, heterogeneous, and autonomous databases ACM Computing Surveys. ,vol. 22, pp. 183- 236 ,(1990) , 10.1145/96602.96604
Norbert Fuhr, Norbert Gövert, Kai Großjohann, HyREX Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02. pp. 449- 449 ,(2002) , 10.1145/564376.564490
AnHai Doan, Jayant Madhavan, Robin Dhamankar, Pedro Domingos, Alon Halevy, Learning to match ontologies on the Semantic Web very large data bases. ,vol. 12, pp. 303- 319 ,(2003) , 10.1007/S00778-003-0104-2
David Carmel, Yoelle S. Maarek, Matan Mandelbrod, Yosi Mass, Aya Soffer, Searching XML documents via XML fragments international acm sigir conference on research and development in information retrieval. pp. 151- 158 ,(2003) , 10.1145/860435.860464