作者: Juan C Quiroz , Liliana Laranjo , Catalin Tufanaru , Ahmet Baki Kocaballi , Dana Rezazadegan
DOI: 10.1016/J.IJMEDINF.2020.104324
关键词: Probability distribution 、 Power law 、 Exponential function 、 Mathematics 、 Word lists by frequency 、 Zipf's law 、 Bayesian probability 、 Log-normal distribution 、 Statistics 、 Prior probability
摘要: Abstract Background Bayesian modelling and statistical text analysis rely on informed probability priors to encourage good solutions. Objective This paper empirically analyses whether in medical discharge reports follow Zipf’s law, a commonly assumed property of language where word frequency follows discrete power-law distribution. Method We examined 20,000 from the MIMIC-III dataset. Methods included splitting into tokens, counting token frequency, fitting distributions data, testing alternative distributions—lognormal, exponential, stretched truncated power-law—provided superior fits data. Result Discharge are best fit by lognormal distributions. appear be near-Zipfian having provide over pure power-law. Conclusion Our findings suggest that report would benefit using non-parametric models capture behavior.