作者: Tomáš Brychcín , Miloslav Konopík
DOI: 10.1016/J.IPM.2014.08.006
关键词:
摘要: Abstract Research into unsupervised ways of stemming has resulted, in the past few years, development methods that are reliable and perform well. Our approach further shifts boundaries state art by providing more accurate results. The idea consists building a stemmer two stages. In first stage, algorithm based upon clustering, which exploits lexical semantic information words, is used to prepare large-scale training data for second-stage algorithm. uses maximum entropy classifier. stemming-specific features help classifier decide when how stem particular word. our research, we have pursued goal creating multi-purpose tool. Its design opens up possibilities solving non-traditional tasks such as approximating lemmas or improving language modeling. However, still aim at very good results traditional task retrieval. conducted tests reveal exceptional performance all above mentioned tasks. method compared with three state-of-the-art statistical algorithms one rule-based We corpora Czech, Slovak, Polish, Hungarian, Spanish English languages. tests, excels previously unseen words (the not present set). Moreover, it was discovered demands little text competing algorithms.