A comparison of algorithms for maximum entropy parameter estimation

作者: Robert Malouf

DOI: 10.3115/1118853.1118871

关键词: Gradient descentPrinciple of maximum entropyVariable (computer science)Estimation theoryAlgorithmConjugate gradient methodComputer scienceGeneralized iterative scalingFree parameterMathematical optimizationMetric (mathematics)

摘要: Conditional maximum entropy (ME) models provide a general purpose machine learning technique which has been successfully applied to fields as diverse computer vision and econometrics, is used for wide variety of classification problems in natural language processing. However, the flexibility ME not without cost. While parameter estimation conceptually straightforward, practice typical tasks are very large, may well contain many thousands free parameters. In this paper, we consider number algorithms estimating parameters models, including iterative scaling, gradient ascent, conjugate gradient, variable metric methods. Sur-prisingly, standardly scaling perform quite poorly comparison others, all test problems, limited-memory algorithm outperformed other choices.

参考文章(25)
Alpino: Wide-coverage computational analysis of Dutch computational linguistics in the netherlands. pp. 45- 59 ,(2001) , 10.1163/9789004333901_004
Satish Balay, William D. Gropp, Lois Curfman McInnes, Barry F. Smith, Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries Modern Software Tools for Scientific Computing. pp. 163- 202 ,(1997) , 10.1007/978-1-4612-1986-6_8
Steven P. Abney, Stochastic attribute-value grammars Computational Linguistics. ,vol. 23, pp. 597- 618 ,(1997) , 10.5555/972791.972800
John D. Lafferty, Bernhard Suhm, Cluster Expansions and Iterative Scaling for Maximum Entropy Language Models Maximum Entropy and Bayesian Methods. pp. 195- 202 ,(1996) , 10.1007/978-94-011-5430-7_23
L. L. Campbell, Equivalence of Gauss's Principle and Minimum Discrimination Information Estimation of Probabilities Annals of Mathematical Statistics. ,vol. 41, pp. 1011- 1015 ,(1970) , 10.1214/AOMS/1177696977
Miles Osborne, Using maximum entropy for sentence extraction Proceedings of the ACL-02 Workshop on Automatic Summarization -. pp. 1- 8 ,(2002) , 10.3115/1118162.1118163
J. N. Darroch, D. Ratcliff, Generalized Iterative Scaling for Log-Linear Models Annals of Mathematical Statistics. ,vol. 43, pp. 1470- 1480 ,(1972) , 10.1214/AOMS/1177692379
W. Edwards Deming, Frederick F. Stephan, On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known Annals of Mathematical Statistics. ,vol. 11, pp. 427- 444 ,(1940) , 10.1214/AOMS/1177731829
E. T. Jaynes, Information Theory and Statistical Mechanics Physical Review. ,vol. 106, pp. 620- 630 ,(1957) , 10.1103/PHYSREV.106.620