Web Text Corpus for Natural Language Processing

DOI:

关键词: HTML 、 Web page 、 Computer science 、 Natural language processing 、 Information extraction 、 Site map 、 Question answering 、 Artificial intelligence 、 Text corpus 、 Information retrieval 、 Search engine indexing 、 Full text search

摘要: Web text has been successfully used as training data for many NLP applications. While most previous work accesses web through search engine hit counts, we created a Corpus by downloading pages to create topic-diverse collection of 10 billion words English. We show that context-sensitive spelling correction the results are better than using engine. For thesaurus extraction, it achieved similar overall corpus newspaper text. With more available on web, can be obtained collecting much larger corpora.

aclanthology.org PDF 下载加速

参考文章(17)

Martin Volk, Exploiting the WWW as a corpus to resolve PP attachment ambiguities Volk, Martin (2001). Exploiting the WWW as a corpus to resolve PP attachment ambiguities. In: Corpus Linguistics, Lancaster, 2001 - 2001.. ,(2001) , 10.5167/UZH-20269

Andrew R. Golding, A Bayesian Hybrid Method for Context-sensitive Spelling Correction. meeting of the association for computational linguistics. ,(1996)

Steve Lawrence, C. Lee Giles, Accessibility of information on the web Nature. ,vol. 400, pp. 107- 109 ,(1999) , 10.1038/21987

Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher, Marc Najork, On near-uniform URL sampling the web conference. ,vol. 33, pp. 295- 308 ,(2000) , 10.1016/S1389-1286(00)00055-4

Mirella Lapata, Frank Keller, Web-based models for natural language processing ACM Transactions on Speech and Language Processing. ,vol. 2, pp. 3- ,(2005) , 10.1145/1075389.1075392

C. L. A. Clarke, G. V. Cormack, M. Laszlo, T. R. Lynam, E. L. Terra, The impact of corpus size on question answering performance Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02. pp. 369- 370 ,(2002) , 10.1145/564376.564448

Natalia N. Modjeska, Katja Markert, Malvina Nissim, Using the web in machine learning for other-anaphora resolution Proceedings of the 2003 conference on Empirical methods in natural language processing -. pp. 176- 183 ,(2003) , 10.3115/1119355.1119378

Egidio Terra, C. L. A. Clarke, Frequency estimates for statistical word similarity measures Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL '03. pp. 165- 172 ,(2003) , 10.3115/1073445.1073477

James Richard Curran, From Distributional to Semantic Similarity University of Edinburgh. College of Science and Engineering. School of Informatics.. ,(2004)

10.

Andrew R. Golding, Dan Roth, A Winnow-Based Approach to Context-Sensitive Spelling Correction Machine Learning. ,vol. 34, pp. 107- 130 ,(1999) , 10.1023/A:1007545901558

Web Text Corpus for Natural Language Processing

来源期刊

我的账户

Web Text Corpus for Natural Language Processing

来源期刊

相似文章 10

我的账户