Improving recall of regular expressions for information extraction

作者: Karin Murthy , Deepak P. , Prasad M. Deshpande

DOI: 10.1007/978-3-642-35063-4_33

关键词:

摘要: Learning or writing regular expressions to identify instances of a specific concept within text documents with high precision and recall is challenging. It relatively easy improve the an initial expression by identifying false positives covered tweaking avoid positives. However, modifying difficult since negatives can only be identified manually analyzing all documents, in absence any tools missing instances. We focus on partially automating discovery soliciting minimal user feedback. present technique good generalizations that have improved while retaining precision. empirically demonstrate effectiveness proposed as compared existing methods show results for variety tasks such identification dates, phone numbers, product names, course numbers real world datasets.

参考文章(16)
Yiming Yang, Bryan Klimt, Introducing the Enron Corpus. conference on email and anti-spam. ,(2004)
Tianhao Wu, William M. Pottenger, A semi-supervised active learning algorithm for information extraction from textual data: Research Articles intelligence and security informatics. ,vol. 56, pp. 258- 271 ,(2005) , 10.1002/ASI.V56:3
Fabio Ciravegna, Adaptive information extraction from text by rule induction and generalisation international joint conference on artificial intelligence. pp. 1251- 1256 ,(2001)
François Denis, Learning Regular Languages from Simple Positive Examples Machine Learning. ,vol. 44, pp. 37- 66 ,(2001) , 10.1023/A:1010826628977
Douglas E. Appelt, Introduction to information extraction Ai Communications. ,vol. 12, pp. 161- 172 ,(1999) , 10.5555/1216155.1216161
Tom M Mitchell, None, Generalization as search Artificial Intelligence. ,vol. 18, pp. 203- 226 ,(1982) , 10.1016/0004-3702(82)90040-6
Tianhao Wu, William M. Pottenger, A semi‐supervised active learning algorithm for information extraction from textual data Journal of the Association for Information Science and Technology. ,vol. 56, pp. 258- 271 ,(2005) , 10.1002/ASI.20119
Rohit Babbar, Nidhi Singh, Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text Proceedings of the fourth workshop on Analytics for noisy unstructured text data - AND '10. pp. 43- 50 ,(2010) , 10.1145/1871840.1871848
Henning Fernau, Algorithms for learning regular expressions from positive data Information & Computation. ,vol. 207, pp. 521- 541 ,(2009) , 10.1016/J.IC.2008.12.008