作者: Vinci Liu , James R. Curran
DOI:
关键词: HTML 、 Web page 、 Computer science 、 Natural language processing 、 Information extraction 、 Site map 、 Question answering 、 Artificial intelligence 、 Text corpus 、 Information retrieval 、 Search engine indexing 、 Full text search
摘要: Web text has been successfully used as training data for many NLP applications. While most previous work accesses web through search engine hit counts, we created a Corpus by downloading pages to create topic-diverse collection of 10 billion words English. We show that context-sensitive spelling correction the results are better than using engine. For thesaurus extraction, it achieved similar overall corpus newspaper text. With more available on web, can be obtained collecting much larger corpora.