Crowdsourcing corpus cleaning for language learning resource development

作者: Tanara Zingano Kuhn , Peter Dekker , Branislava Šandrih , Rina Zviel-Girshin , Špela Arhar

DOI:

关键词:

摘要: Web corpora are valuable sources for the development of language learning material. However, the data may contain inappropriate or even offensive language, thus requiring data checking and filtering before pedagogical use. One automatically-created language learning resource that is based on pre-checked corpora is Sketch Engine for Language Learning (SkeLL; https://skell. sketchengine. co. uk/), where texts containing so-called PARSNIPs (Politics-Alcohol-Racism-Sex-Narcotics-Isms-Pork)(Kilgarriff et al., 2014) have been excluded from the corpora. In previous projects, filtering of such sensitive words (and by extent, content) was done automatically by using predefined seedwords. However, this approach removes a great deal of data in a somewhat non-controlled way, while on the other hand many inappropriate sentences remain unidentified.We propose a crowdsourcing approach to clean up corpora for developing SkELL for Portuguese, Dutch and Serbian. Here, we use samples of web corpora from the Sketch Engine corpus management system (Kilgarriff et al., 2004) as a basis, but our approach could be applied to any web corpus. We present sentences to a crowd, consisting of native speakers of those languages, through the PYBOSSA (https://pybossa. com) platform. The sentences are selected from a sample corpus and consist of potentially good and “bad”(inappropriate) sentences. The inappropriate sentences are included as ground truth for analysis. A feature from the Sketch Engine that we use in our approach is the Good Dictionary Examples (GDEX) function (Kilgarriff et al., 2008), which ranks the concordances in the …

参考文章(0)