作者: Nicolas Papernot , Mohit Iyyer , Ankur P. Parikh , Gaurav Singh Tomar , Kalpesh Krishna
DOI:
关键词:
摘要: We study the problem of model extraction in natural language processing, which an adversary with only query access to a victim attempts reconstruct local copy that model. Assuming both and fine-tune large pretrained such as BERT (Devlin et al., 2019), we show does not need any real training data successfully mount attack. In fact, attacker even use grammatical or semantically meaningful queries: random sequences words coupled task-specific heuristics form effective queries for on diverse set NLP tasks including inference question answering. Our work thus highlights exploit made feasible by shift towards transfer learning methods within community: budget few hundred dollars, can extract performs slightly worse than Finally, two defense strategies against extraction—membership classification API watermarking—which while successful some adversaries also be circumvented more clever ones.