Using web data for linguistic purposes

作者:

DOI: 10.1163/9789401203791_003

关键词:

摘要: The world wide web is a mine of language data unprecedented richness and ease access (Kilgarriff Grefenstette 2003). A growing body studies has shown that simple algorithms using web-based evidence are successful at many linguistic tasks, often outperforming sophisticated methods based on smaller but more controlled sources (cf. Turney 2001; Keller Lapata Most current internet-based the through commercial search engine. For example, some researchers rely frequency estimates (number hits) reported by engines (e.g. 2001). Others use engine to find relevant pages, then retrieve pages build corpus Ghani Mladenic Baroni Bernardini 2004). In this study, we first survey state art, discussing advantages limits various approaches, in particular inherent limitations depending as source. We focus what believe be core issues do linguistics. Some these concern quality nature can obtain from internet (What languages, genres styles represented web?), others pertain extraction, encoding preservation (How ensure stability? How marked up categorized? identify duplicate near duplicates?), yet quantitative aspects (Which statistical quantities reliably estimated data, how much need? What possible pitfalls due massive presence duplicates, mixed-language pages?). All points illustrated concrete examples English, German Italian corpora.

参考文章(15)
Silvia Bernardini, Marco Baroni, BootCaT: Bootstrapping corpora and terms from the web language resources and evaluation. pp. 1313- 1316 ,(2004)
Corpora and language learners John Benjamins Publishing Company. ,(2004) , 10.1075/SCL.17
Anke Lüdeling, Stefan Evert, The emergence of productive non-medical -itis Humboldt-Universität zu Berlin, Philosophische Fakultät II. ,(2004) , 10.18452/13446
Peter D. Turney, Mining the web for synonyms: PMI-IR versus LSA on TOEFL european conference on machine learning. pp. 491- 502 ,(2001) , 10.1007/3-540-44795-4_42
Rayid Ghani, Rosie Jones, Dunja Mladenić, Mining the web to create minority language corpora Proceedings of the tenth international conference on Information and knowledge management - CIKM'01. pp. 279- 286 ,(2001) , 10.1145/502585.502633
Adam Kilgarriff, Gregory Grefenstette, Introduction to the special issue on the web as corpus Computational Linguistics. ,vol. 29, pp. 333- 347 ,(2003) , 10.1162/089120103322711569
Serge Sharoff, Open-source Corpora: Using the net to fish for linguistic data International Journal of Corpus Linguistics. ,vol. 11, pp. 435- 462 ,(2006) , 10.1075/IJCL.11.4.05SHA
Frank Keller, Mirella Lapata, Using the web to obtain frequencies for unseen bigrams Computational Linguistics. ,vol. 29, pp. 459- 484 ,(2003) , 10.1162/089120103322711604