摘要: The world wide web is a mine of language data unprecedented richness and ease access (Kilgarriff Grefenstette 2003). A growing body studies has shown that simple algorithms using web-based evidence are successful at many linguistic tasks, often outperforming sophisticated methods based on smaller but more controlled sources (cf. Turney 2001; Keller Lapata Most current internet-based the through commercial search engine. For example, some researchers rely frequency estimates (number hits) reported by engines (e.g. 2001). Others use engine to find relevant pages, then retrieve pages build corpus Ghani Mladenic Baroni Bernardini 2004). In this study, we first survey state art, discussing advantages limits various approaches, in particular inherent limitations depending as source. We focus what believe be core issues do linguistics. Some these concern quality nature can obtain from internet (What languages, genres styles represented web?), others pertain extraction, encoding preservation (How ensure stability? How marked up categorized? identify duplicate near duplicates?), yet quantitative aspects (Which statistical quantities reliably estimated data, how much need? What possible pitfalls due massive presence duplicates, mixed-language pages?). All points illustrated concrete examples English, German Italian corpora.