Web Text Corpus for Natural Language Processing

作者: Vinci Liu , James R. Curran

DOI:

关键词: HTMLWeb pageComputer scienceNatural language processingInformation extractionSite mapQuestion answeringArtificial intelligenceText corpusInformation retrievalSearch engine indexingFull text search

摘要: Web text has been successfully used as training data for many NLP applications. While most previous work accesses web through search engine hit counts, we created a Corpus by downloading pages to create topic-diverse collection of 10 billion words English. We show that context-sensitive spelling correction the results are better than using engine. For thesaurus extraction, it achieved similar overall corpus newspaper text. With more available on web, can be obtained collecting much larger corpora.

参考文章(17)
Martin Volk, Exploiting the WWW as a corpus to resolve PP attachment ambiguities Volk, Martin (2001). Exploiting the WWW as a corpus to resolve PP attachment ambiguities. In: Corpus Linguistics, Lancaster, 2001 - 2001.. ,(2001) , 10.5167/UZH-20269
Andrew R. Golding, A Bayesian Hybrid Method for Context-sensitive Spelling Correction. meeting of the association for computational linguistics. ,(1996)
Steve Lawrence, C. Lee Giles, Accessibility of information on the web Nature. ,vol. 400, pp. 107- 109 ,(1999) , 10.1038/21987
Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher, Marc Najork, On near-uniform URL sampling the web conference. ,vol. 33, pp. 295- 308 ,(2000) , 10.1016/S1389-1286(00)00055-4
Mirella Lapata, Frank Keller, Web-based models for natural language processing ACM Transactions on Speech and Language Processing. ,vol. 2, pp. 3- ,(2005) , 10.1145/1075389.1075392
C. L. A. Clarke, G. V. Cormack, M. Laszlo, T. R. Lynam, E. L. Terra, The impact of corpus size on question answering performance Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02. pp. 369- 370 ,(2002) , 10.1145/564376.564448
Natalia N. Modjeska, Katja Markert, Malvina Nissim, Using the web in machine learning for other-anaphora resolution Proceedings of the 2003 conference on Empirical methods in natural language processing -. pp. 176- 183 ,(2003) , 10.3115/1119355.1119378
Egidio Terra, C. L. A. Clarke, Frequency estimates for statistical word similarity measures Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL '03. pp. 165- 172 ,(2003) , 10.3115/1073445.1073477
James Richard Curran, From Distributional to Semantic Similarity University of Edinburgh. College of Science and Engineering. School of Informatics.. ,(2004)
Andrew R. Golding, Dan Roth, A Winnow-Based Approach to Context-Sensitive Spelling Correction Machine Learning. ,vol. 34, pp. 107- 130 ,(1999) , 10.1023/A:1007545901558