HPS: High precision stemmer

作者: Tomáš Brychcín , Miloslav Konopík

DOI: 10.1016/J.IPM.2014.08.006

关键词:

摘要: Abstract Research into unsupervised ways of stemming has resulted, in the past few years, development methods that are reliable and perform well. Our approach further shifts boundaries state art by providing more accurate results. The idea consists building a stemmer two stages. In first stage, algorithm based upon clustering, which exploits lexical semantic information words, is used to prepare large-scale training data for second-stage algorithm. uses maximum entropy classifier. stemming-specific features help classifier decide when how stem particular word. our research, we have pursued goal creating multi-purpose tool. Its design opens up possibilities solving non-traditional tasks such as approximating lemmas or improving language modeling. However, still aim at very good results traditional task retrieval. conducted tests reveal exceptional performance all above mentioned tasks. method compared with three state-of-the-art statistical algorithms one rule-based We corpora Czech, Slovak, Polish, Hungarian, Spanish English languages. tests, excels previously unseen words (the not present set). Moreover, it was discovered demands little text competing algorithms.

参考文章(35)
Julie Beth Lovins, Development of a Stemming Algorithm Mech. Transl. Comput. Linguistics. ,vol. 11, pp. 22- 31 ,(1968)
Mariona Taulé, Maria Antònia Martí, Marta Recasens, AnCora: Multilevel Annotated Corpora for Catalan and Spanish language resources and evaluation. ,(2008)
Michal Konkol, Brainy: A Machine Learning Library Artificial Intelligence and Soft Computing. pp. 490- 499 ,(2014) , 10.1007/978-3-319-07176-3_43
Douglas W. Oard, Gina-Anne Levow, Clara I. Cabezas, CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation cross language evaluation forum. pp. 176- 187 ,(2000) , 10.1007/3-540-44645-1_17
Lalit R. Bahl, Frederick Jelinek, Robert L. Mercer, A Maximum Likelihood Approach to Continuous Speech Recognition IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. PAMI-5, pp. 179- 190 ,(1983) , 10.1109/TPAMI.1983.4767370
Michela Bacchin, Nicola Ferro, Massimo Melucci, A probabilistic model for stemmer generation Information Processing and Management. ,vol. 41, pp. 121- 137 ,(2005) , 10.1016/J.IPM.2004.04.006
WALTER G. CHARLES, Contextual correlates of meaning Applied Psycholinguistics. ,vol. 21, pp. 505- 524 ,(2000) , 10.1017/S0142716400004057
Galen Andrew, Jianfeng Gao, Scalable training of L1-regularized log-linear models international conference on machine learning. pp. 33- 40 ,(2007) , 10.1145/1273496.1273501
Rodney D Huddleston, None, English Grammar: An Outline ,(1988)
Chris D. Paice, An evaluation method for stemming algorithms international acm sigir conference on research and development in information retrieval. pp. 42- 50 ,(1994) , 10.5555/188490.188499