作者: Fredric C. Gey
关键词: Logistic regression 、 Data mining 、 Statistical hypothesis testing 、 Cosine similarity 、 Inference 、 Relevance (information retrieval) 、 tf–idf 、 Statistics 、 Document retrieval 、 Multinomial logistic regression 、 Vector space model 、 Computer science 、 Probabilistic logic 、 Weighting
摘要: This research evaluates a model for probabilistic text and document retrieval; the utilizes technique of logistic regression to obtain equations which rank documents by probability relevance as function query properties. Since infers from statistical clues present in texts queries, we call it inference. By transforming distribution each clue into its standardized (one with mean μ = 0 standard deviation σ 1), method allows one apply coefficients derived training collection other collections, little loss predictive power. The is applied three well-known information retrieval test results are compared directly particular vector space uses term-frequency/inverse-document-frequency (tfidf) weighting cosine similarity measure. In comparison, inference performs significantly better than (in two collections) or equally well third collection) tfidf/cosine model. differences performances models were subjected tests see if statistically significant could have occurred chance.