One billion word benchmark for measuring progress in statistical language modeling.

作者: Thorsten Brants , Tomas Mikolov , Tony Robinson , Ciprian Chelba , Qi Ge

DOI:

关键词:

摘要: We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this will useful quickly evaluate novel modeling techniques, and compare their contribution when combined with other advanced techniques. show performance several well-known types models, the best results achieved recurrent neural network based model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6; combination techniques leads 35% reduction perplexity, or 10% cross-entropy (bits), over that baseline. The is available as code.google.com project; besides scripts needed rebuild training/held-out it also makes log-probability values each word ten held-out data sets, n-gram models.

参考文章(0)