作者: Isabel Moreno-Sánchez , Francesc Font-Clos , Álvaro Corral
DOI: 10.1371/JOURNAL.PONE.0147073
关键词: Computer science 、 Artificial intelligence 、 Zipf's law 、 Point (typography) 、 Natural language processing 、 Cumulative distribution function 、 Statistical significance 、 Probability distribution 、 Probability density function 、 Random variable 、 Monte Carlo method
摘要: Despite being a paradigm of quantitative linguistics, Zipf’s law for words suffers from three main problems: its formulation is ambiguous, validity has not been tested rigorously statistical point view, and it confronted to representatively large number texts. So, we can summarize the current support in texts as anecdotic. We try solve these issues by studying different versions fitting them all available English Project Gutenberg database (consisting more than 30 000 texts). To do so use state-of-the art tools goodness-of-fit tests, carefully tailored peculiarities text statistics. Remarkably, one law, consisting pure power-law form complementary cumulative distribution function word frequencies, able fit 40% (at 0.05 significance level), whole domain frequencies (from 1 maximum value), with only free parameter (the exponent).