Large-Scale Analysis of Zipf’s Law in English Texts

作者: Isabel Moreno-Sánchez , Francesc Font-Clos , Álvaro Corral

DOI: 10.1371/JOURNAL.PONE.0147073

关键词: Computer scienceArtificial intelligenceZipf's lawPoint (typography)Natural language processingCumulative distribution functionStatistical significanceProbability distributionProbability density functionRandom variableMonte Carlo method

摘要: Despite being a paradigm of quantitative linguistics, Zipf’s law for words suffers from three main problems: its formulation is ambiguous, validity has not been tested rigorously statistical point view, and it confronted to representatively large number texts. So, we can summarize the current support in texts as anecdotic. We try solve these issues by studying different versions fitting them all available English Project Gutenberg database (consisting more than 30 000 texts). To do so use state-of-the art tools goodness-of-fit tests, carefully tailored peculiarities text statistics. Remarkably, one law, consisting pure power-law form complementary cumulative distribution function word frequencies, able fit 40% (at 0.05 significance level), whole domain frequencies (from 1 maximum value), with only free parameter (the exponent).

参考文章(65)
Damián H. Zanette, Statistical Patterns in Written Language arXiv: Computation and Language. ,(2014)
Francesc Font-Clos, Álvaro Corral, Log-Log Convexity of Type-Token Growth in Zipf's Systems Physical Review Letters. ,vol. 114, pp. 238701- 238701 ,(2015) , 10.1103/PHYSREVLETT.114.238701
Pamela Morris, Yudi Pawitan, In all likelihood : statistical modelling and inference using likelihood The Mathematical Gazette. ,vol. 86, pp. 375- 376 ,(2002) , 10.2307/3621915
Jake Ryland Williams, James P. Bagrow, Christopher M. Danforth, Peter Sheridan Dodds, Text mixing shapes the anatomy of rank-frequency distributions Physical Review E. ,vol. 91, pp. 052811- ,(2015) , 10.1103/PHYSREVE.91.052811
Álvaro Corral, Gemma Boleda, Ramon Ferrer-i-Cancho, Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts PLOS ONE. ,vol. 10, pp. 1- 23 ,(2015) , 10.1371/JOURNAL.PONE.0129031
Ramon Ferrer-i-Cancho, Anna Deluca, Alvaro Corral, A practical recipe to fit discrete power-law distributions arXiv: Applications. ,(2012)