ONDUX

作者: Eli Cortez , Altigran S. da Silva , Marcos André Gonçalves , Edleno S. de Moura

DOI: 10.1145/1807167.1807254

关键词:

摘要: Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized implicit semi-structured records available textual sources (e.g. postal addresses, bibliographic information, ads). It is an important practical problem that has been frequently addressed the recent literature. In this paper we introduce ONDUX (On Demand Unsupervised Extraction), a new unsupervised probabilistic approach for IETS. As other IETS approaches, relies on information pre-existing associate segments input string with attributes given domain. Unlike rely very effective matching strategies instead explicit learning strategies. The effectiveness strategy also exploited disambiguate certain through reinforcement step explores sequencing and positioning attribute directly learned on-demand from test data, no previous human-driven training, feature unique ONDUX. This assigns high degree flexibility results superior effectiveness, as demonstrated experimental evaluation report different domains, compared state-of-art approach.

参考文章(18)
Jalal Mahmud, I. V. Ramakrishnan, Chang Zhao, Exploiting Structured Reference Data for Unsupervised Text Segmentation with Conditional Random Fields. siam international conference on data mining. pp. 420- 431 ,(2008)
Kevin Chen-Chuan Chang, ChengXiang Zhai, Shui-Lung Chuang, Context-aware wrapping: synchronized data extraction very large data bases. pp. 699- 710 ,(2007)
Fuchun Peng, Andrew McCallum, Information extraction from research papers using conditional random fields Information Processing & Management. ,vol. 42, pp. 963- 979 ,(2006) , 10.1016/J.IPM.2005.09.002
Filipe Mesquita, Altigran S da Silva, Edleno S de Moura, Pavel Calado, Alberto HF Laender, None, LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces Information Processing & Management. ,vol. 43, pp. 983- 1004 ,(2007) , 10.1016/J.IPM.2006.09.018
Vinayak Borkar, Kaustubh Deshmukh, Sunita Sarawagi, Automatic segmentation of text into structured records international conference on management of data. ,vol. 30, pp. 175- 186 ,(2001) , 10.1145/375663.375682
Eli Cortez, Altigran S da Silva, Marcos André Gonçalves, Filipe Mesquita, Edleno S de Moura, None, A flexible approach for extracting metadata from bibliographic citations Journal of the Association for Information Science and Technology. ,vol. 60, pp. 1144- 1158 ,(2009) , 10.1002/ASI.V60:6
Eli Cortez, Altigran S da Silva, Marcos André Gonçalves, Filipe Mesquita, Edleno S de Moura, None, FLUX-CIM Proceedings of the 2007 conference on Digital libraries - JCDL '07. pp. 215- 224 ,(2007) , 10.1145/1255175.1255219
Thorsten Joachims, Transductive Inference for Text Classification using Support Vector Machines international conference on machine learning. pp. 200- 209 ,(1999)
L. P. Kaelbling, M. L. Littman, A. W. Moore, Reinforcement learning: a survey Journal of Artificial Intelligence Research. ,vol. 4, pp. 237- 285 ,(1996) , 10.1613/JAIR.301
I.R. Mansuri, S. Sarawagi, Integrating Unstructured Data into Relational Databases international conference on data engineering. pp. 29- 29 ,(2006) , 10.1109/ICDE.2006.83