Empirical analysis of Zipf's law, power law, and lognormal distributions in medical discharge reports.

作者: Juan C Quiroz , Liliana Laranjo , Catalin Tufanaru , Ahmet Baki Kocaballi , Dana Rezazadegan

DOI: 10.1016/J.IJMEDINF.2020.104324

关键词: Probability distributionPower lawExponential functionMathematicsWord lists by frequencyZipf's lawBayesian probabilityLog-normal distributionStatisticsPrior probability

摘要: Abstract Background Bayesian modelling and statistical text analysis rely on informed probability priors to encourage good solutions. Objective This paper empirically analyses whether in medical discharge reports follow Zipf’s law, a commonly assumed property of language where word frequency follows discrete power-law distribution. Method We examined 20,000 from the MIMIC-III dataset. Methods included splitting into tokens, counting token frequency, fitting distributions data, testing alternative distributions—lognormal, exponential, stretched truncated power-law—provided superior fits data. Result Discharge are best fit by lognormal distributions. appear be near-Zipfian having provide over pure power-law. Conclusion Our findings suggest that report would benefit using non-parametric models capture behavior.

参考文章(44)
Dario A. Giuse, S. Trent Rosenbloom, Joshua C. Denny, Randolph A. Miller, Yonghui Wu, Subramani Mani, Hua Xu, Detecting abbreviations in discharge summaries using machine learning methods. american medical informatics association annual symposium. ,vol. 2011, pp. 1541- 1549 ,(2011)
Zoubin Ghahramani, None, Probabilistic machine learning and artificial intelligence Nature. ,vol. 521, pp. 452- 459 ,(2015) , 10.1038/NATURE14541
Álvaro Corral, Gemma Boleda, Ramon Ferrer-i-Cancho, Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts PLOS ONE. ,vol. 10, pp. 1- 23 ,(2015) , 10.1371/JOURNAL.PONE.0129031
Colin S. Gillespie, Fitting Heavy Tailed Distributions: The poweRlaw Package Journal of Statistical Software. ,vol. 64, pp. 1- 16 ,(2015) , 10.18637/JSS.V064.I02
David M. W. Powers, Applications and explanations of Zipf's law Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning - NeMLaP3/CoNLL '98. pp. 151- 160 ,(1998) , 10.3115/1603899.1603924
Jonathan D Paladino, Philip S Crooke, Christopher R Brackney, A Murat Kaynar, John R Hotchkiss, Medical practices display power law behaviors similar to spoken languages. BMC Medical Informatics and Decision Making. ,vol. 13, pp. 102- 102 ,(2013) , 10.1186/1472-6947-13-102
Michael Mitzenmacher, A Brief History of Generative Models for Power Law and Lognormal Distributions Internet Mathematics. ,vol. 1, pp. 226- 251 ,(2004) , 10.1080/15427951.2004.10129088
Aaron Clauset, Cosma Rohilla Shalizi, M. E. J. Newman, Power-Law Distributions in Empirical Data Siam Review. ,vol. 51, pp. 661- 703 ,(2009) , 10.1137/070710111
Irena Spasić, Jacqueline Livsey, John A. Keane, Goran Nenadić, Text mining of cancer-related information: review of current status and future directions International Journal of Medical Informatics. ,vol. 83, pp. 605- 623 ,(2014) , 10.1016/J.IJMEDINF.2014.06.009