The Quality of Content in Open Online Collaboration Platforms: Approaches to NLP-supported Information Quality Management in Wikipedia

作者: Oliver Ferschke

DOI:

关键词:

摘要: Over the past decade, paradigm of World Wide Web has shifted from static web pages towards participatory and collaborative content production. The main properties this user generated are a low publication threshold little or no editorial control. While improved variety timeliness available information, it causes an even higher variance in quality than already heterogeneous traditional content. Wikipedia is prime example for successful, large-scale, collaboratively created resource that reflects spirit open creation paradigm. Even though recent studies have confirmed overall high, there still wide gap must be bridged before reaches state reliable, citable source. A key prerequisite to reaching goal management strategy can cope both with massive scale its almost anarchic nature. This includes efficient communication platform work coordination among collaborators as well techniques monitoring problems across encyclopedia. dissertation shows how natural language processing approaches used assist information on scale. In first part thesis, we establish theoretical foundations our work. We introduce relatively new concept online collaboration particular focus writing proceed detailed discussion role encyclopedia, community, platform, knowledge technology applications. then three contributions thesis. Even been previous attempts adapt existing frameworks Wikipedia, model yet incorporated central factor. Since not only repository mere facts but rather consists full text articles, these articles taken into consideration when judging article quality. As contribution therefore define comprehensive aims consolidate criteria defined multiple guidelines policies single model. comprises 23 dimensions segmented four layers intrinsic quality, contextual organizational As second contribution, present approach automatically identifying flaws articles. Even general idea detection introduced work, dissect find task inherently prone topic bias which results unrealistically high cross-validated evaluation do reflect classifier’s real performance world data. We solve problem novel data sampling based revision history able avoid bias. It furthermore allows us identify flawed also reliable counterexamples exhibit respective flaws. For detecting unseen FlawFinder, modular system supervised classification. evaluate corpus neutrality style confirm initial hypothesis classifiers tend lower biased ones scores more closely resemble their actual performance in wild. As third segmenting tagging Talk improve Wikipedians. These unstructured easy navigate likely get lost over time archives. By discussed solutions proposed, help users make informed decisions future. Our area threefold: (i) describe algorithm dialog using history. In contrast related mainly relies rudimentary markup, reliably extract meta data, such identity user, moreover handle discontinuous turns. (ii) scheme annotating turns discussions act labels capturing efforts improvement. types criticism turn, missing inappropriate language, any actions proposed solving problems. (iii) Based scheme, two manually annotated corpora extracted Simple English (SEWD) (EWD). classification learn assign achieve F1 = 0.82 SEWD while obtain average 0.78 larger complex EWD corpus.

参考文章(0)