Using web data for linguistic purposes

作者：

DOI: 10.1163/9789401203791_003

关键词:

摘要: The world wide web is a mine of language data unprecedented richness and ease access (Kilgarriff Grefenstette 2003). A growing body studies has shown that simple algorithms using web-based evidence are successful at many linguistic tasks, often outperforming sophisticated methods based on smaller but more controlled sources (cf. Turney 2001; Keller Lapata Most current internet-based the through commercial search engine. For example, some researchers rely frequency estimates (number hits) reported by engines (e.g. 2001). Others use engine to find relevant pages, then retrieve pages build corpus Ghani Mladenic Baroni Bernardini 2004). In this study, we first survey state art, discussing advantages limits various approaches, in particular inherent limitations depending as source. We focus what believe be core issues do linguistics. Some these concern quality nature can obtain from internet (What languages, genres styles represented web?), others pertain extraction, encoding preservation (How ensure stability? How marked up categorized? identify duplicate near duplicates?), yet quantitative aspects (Which statistical quantities reliably estimated data, how much need? What possible pitfalls due massive presence duplicates, mixed-language pages?). All points illustrated concrete examples English, German Italian corpora.

brillonline.com LINK 下载加速

unibo.it PDF 下载加速

sci-hub.st HTML 下载加速

参考文章(15)

Silvia Bernardini, Marco Baroni, BootCaT: Bootstrapping corpora and terms from the web language resources and evaluation. pp. 1313- 1316 ,(2004)

Corpora and language learners John Benjamins Publishing Company. ,(2004) , 10.1075/SCL.17

Anke Lüdeling, Stefan Evert, The emergence of productive non-medical -itis Humboldt-Universität zu Berlin, Philosophische Fakultät II. ,(2004) , 10.18452/13446

Peter D. Turney, Mining the web for synonyms: PMI-IR versus LSA on TOEFL european conference on machine learning. pp. 491- 502 ,(2001) , 10.1007/3-540-44795-4_42

Rayid Ghani, Rosie Jones, Dunja Mladenić, Mining the web to create minority language corpora Proceedings of the tenth international conference on Information and knowledge management - CIKM'01. pp. 279- 286 ,(2001) , 10.1145/502585.502633

Adam Kilgarriff, Gregory Grefenstette, Introduction to the special issue on the web as corpus Computational Linguistics. ,vol. 29, pp. 333- 347 ,(2003) , 10.1162/089120103322711569

Serge Sharoff, Open-source Corpora: Using the net to fish for linguistic data International Journal of Corpus Linguistics. ,vol. 11, pp. 435- 462 ,(2006) , 10.1075/IJCL.11.4.05SHA

W. Detmar Meurers, On the use of electronic corpora for theoretical linguistics : Case studies from the syntax of German Lingua. ,vol. 115, pp. 1619- 1639 ,(2005) , 10.1016/J.LINGUA.2004.07.007

Andy Way, Nano Gough, wEBMT : developing and validating an example-based machine translation system using the world wide web Computational Linguistics. ,vol. 29, pp. 421- 457 ,(2003) , 10.1162/089120103322711596

10.

Frank Keller, Mirella Lapata, Using the web to obtain frequencies for unseen bigrams Computational Linguistics. ,vol. 29, pp. 459- 484 ,(2003) , 10.1162/089120103322711604

Using web data for linguistic purposes

来源期刊

我的账户

Using web data for linguistic purposes

来源期刊

相似文章 10

我的账户