That Simple Device Already Used by Gauss

作者: Peter Grünwald , Jorma Rissanen

DOI:

关键词:

摘要: From November 1998 until September 1999, Jorma Rissanen and I met on a regular basis. Here recall some of our stimulating conversations the work that we did together. This work, based almost exclusively single page [12], was left unfinished has never been published, but it indirectly had profound impact my career. 1 Meet first in 1998. just obtained Ph.D. Amsterdam started postdoc at Stanford University. These were exciting times: height dot-com boom, right middle it. Since thesis all about MDL Principle, suggested could meet person during stay California. replied he would like to. delighted, honored also bit worried, since forewarned not your “usual” kind scientist... San Francisco, Category Theory Statistics found that, while one for polite small talk, having lots beer with circle academic friends (for most scientists seems to be other way around). He didn’t talk much, what said invariably point, direct frank. During meeting, when told him lived Francisco seemed me so much nicer than Palo Alto, his immediate reply “I don’t Francisco”. Later day, talked science general, quite surprised learn years IBM, serious student category theory – even published professor Sweden [15]. went say easier statistics, field which made such major contributions. When asked why, “because statistics is nonsense. It exceedingly hard teach yourself nonsense!”. Vintage Jorma: blunt sharp same time. Some fear this, others, me, find Jorma’s conversation delightfully refreshing. is, fact, notorious strong opinions sub-fields mathematics once spent lot time studying complex function theory, exclaimed (not entirely seriously, believe) “you’ve wasted youth!” as harsh own others vividly saying “how have stupid” central proof [13] more difficult needed. understand another his, [11], himself only understands time, “when better days”. am referring perhaps well-known theorem: result an extension information inequality, but, from statistical point view, can interpreted — put it, now less modestly correctly “a grand Cramer-Rao theorem.” Soccer Impounded Cars Our meeting well, visit again. In end, visited every 3-4 weeks year still IBM Almaden, Jose, 40 minutes further down highway. usually arrived 10 morning stayed whole day. Between 12 2 however, often leave play soccer group friends, something three times week. At 67, very good it: youth seriously considered becoming professional player. decided pursue instead visits, doesn’t know whether choice. out soccer, used lunch IBM’s luxurious cafeteria, paid by card. On occasion, car broken down, Nemo, probably arrange new little $ 150: Nemo dealer who took over impounded cars police if they hadn’t picked up year. fascinated: brilliant scientist always speaks mind, played world-class level counts dealers among friends. Rather actually working joint publication, both pursuing related distinct ideas, discussing latest progress meetings. So on? Prediction Coding learned Principle monograph “Stochastic Complexity Statistical Inquiry” “little green book,” eloquently puts forward main ideas underlying Principle. Chapter 2, notes different research communities understanding concept “model”. model family probability distributions, example, Gaussian or normal family. fields pattern recognition machine learning, deterministic hypotheses predictors. For may try relationship between variable X, taking values set X , Y Y, considering “model” F really functions f : → Y. then fit data (x1, y1), . (xn, yn), using, least squares criterion; concrete examples are given below. claims perspective, there no real distinction probabilistic type model: viewed defining codes equivalently, description methods data. inference should proceed selecting (set methods) leads shortest codelength A Simple Device Already Used Gauss How, then, associate models codes? this obvious: Kraft inequality tells us distribution P exists uniquely decodable code outcomes x, x (essentially) equal − log p(x), p being mass The indicates particular reasonable want [3, 3]. But predictors f? According Jorma, map them follows. We each conditional pf defined (y | x) affine (linear constant offset) loss L(y, f(x)) makes predicting y x. use lengths His remarks worth quoting full [[12], 18; mathematical notation slightly adjusted, material square brackets emphasis added myself]: ...The two views however reconciled simple device already distributions bearing name. any desired distance measure L(yi, ŷi), [and predictor under consideration] define density (yi xi) = Ke−L(yi,f(xi)), (1) where K chosen becomes proper range yi. Taking product these, ∏n i=1 xi), observed set, gives sequences [...] [For example] let consist binary sequence y1, yn. With ŷi past observations, ŷi) 0 prediction correct, i.e., yi ŷi, else 1. criterion goodness number mispredictions sequence. Picking base exponential get (1 K/2, depending predicted symbol is. either case, (0 1−P makesK 2/3. mistaken predictions differs quantity ∑ i p(yi seen expression terms corresponds according code] ...As defines mean variance 1/2. induced x), its negative logarithm, n

参考文章(14)
Nicolo Cesa-Bianchi, Gabor Lugosi, Prediction, learning, and games ,(2006)
J. Rissanen, B. Wyman, Duals of input/output maps Lecture Notes in Computer Science. pp. 204- 208 ,(1975) , 10.1007/3-540-07142-3_83
Jorma Rissanen, Stochastic Complexity and Modeling Annals of Statistics. ,vol. 14, pp. 1080- 1100 ,(1986) , 10.1214/AOS/1176350051
M. D. LEE, Three case studies in the Bayesian analysis of cognitive models. Psychonomic Bulletin & Review. ,vol. 15, pp. 1- 15 ,(2008) , 10.3758/PBR.15.1.1
J. Ross Quinlan, Ronald L. Rivest, Inferring decision trees using the minimum description length principle Information & Computation. ,vol. 80, pp. 227- 248 ,(1989) , 10.1016/0890-5401(89)90010-2
Peter Grünwald, Viewing all models as “probabilistic” conference on learning theory. pp. 171- 182 ,(1999) , 10.1145/307400.307436
J.J. Rissanen, Fisher information and stochastic complexity IEEE Transactions on Information Theory. ,vol. 42, pp. 40- 47 ,(1996) , 10.1109/18.481776
Peter Grünwald, John Langford, Suboptimal behavior of Bayes and MDL in classification under misspecification Machine Learning. ,vol. 66, pp. 119- 149 ,(2007) , 10.1007/S10994-007-0716-7
N. Littlestone, M.K. Warmuth, The weighted majority algorithm Information & Computation. ,vol. 108, pp. 212- 261 ,(1994) , 10.1006/INCO.1994.1009