摘要: Our goal in cross-language text classification (CLTC) is to use English training data classify Czech documents (although the concepts presented here are applicable any language pair). CLTC an off-line problem, and authors unaware of previous work this area. motivated by both non-availability (the case, presently, our dataset) possibility leveraging different topic distributions languages improve overall for information retrieval. Consider, example, that speakers tend contribute more some topics than their counterparts (e.g., discuss London Prague), so that, having only English, we may expect do poorly at identifying like Prague. speakers, on other hand, often talk about Prague, data, might detecting Prague speakers; exactly sort thesaurus label which seekers most interested in—because it rare. Accordingly, while a lack presently necessitates CLTC, would have no reason warrant method’s abandonment if such were suddenly become available. dataset collection manually transcribed, spontaneous, conversational speech Czech. transcripts human assigned labels from hierarchical approximately 40,000 labels. Presently, labeled not available classifier training. The hierarchy be divided into two principle branches, containing 1) concept education) 2) precoordinated place-date Germany, 1914 – 1918).