摘要: With the emergence of search engines and crowdsourcing websites, machine learning practitioners are faced with datasets that labeled by a large heterogeneous set teachers. These test limits our existing theory, which largely assumes data is sampled i.i.d. from fixed distribution. In many cases, number teachers actually scales examples, each teacher providing just handful labels, precluding any statistically reliable assessment an individual teacher’s quality. this paper, we study problem pruning low-quality in crowd, order to improve label quality training set. Despite hurdles mentioned above, show fact achievable simple efficient algorithm, does not require example be repeatedly multiple We provide theoretical analysis algorithm back findings empirical evidence.