Cross-language text classification

作者: J. Scott Olsson , Douglas W. Oard , Jan Hajič

DOI: 10.1145/1076034.1076170

关键词:

摘要: Our goal in cross-language text classification (CLTC) is to use English training data classify Czech documents (although the concepts presented here are applicable any language pair). CLTC an off-line problem, and authors unaware of previous work this area. motivated by both non-availability (the case, presently, our dataset) possibility leveraging different topic distributions languages improve overall for information retrieval. Consider, example, that speakers tend contribute more some topics than their counterparts (e.g., discuss London Prague), so that, having only English, we may expect do poorly at identifying like Prague. speakers, on other hand, often talk about Prague, data, might detecting Prague speakers; exactly sort thesaurus label which seekers most interested in—because it rare. Accordingly, while a lack presently necessitates CLTC, would have no reason warrant method’s abandonment if such were suddenly become available. dataset collection manually transcribed, spontaneous, conversational speech Czech. transcripts human assigned labels from hierarchical approximately 40,000 labels. Presently, labeled not available classifier training. The hierarchy be divided into two principle branches, containing 1) concept education) 2) precoordinated place-date Germany, 1914 – 1918).

参考文章(2)
GUIDO MINNEN, JOHN CARROLL, DARREN PEARCE, Applied morphological processing of English Natural Language Engineering. ,vol. 7, pp. 207- 223 ,(2001) , 10.1017/S1351324901002728
Martin Cmejrek, Jan Curín, Automatic Translation Lexicon Extraction from Czech-English Parallel Texts. The Prague Bulletin of Mathematical Linguistics. ,vol. 71, pp. 47- 58 ,(1999)