A machine-learning approach to discovering company home pages

作者: Wojciech Gryc , Prem Melville , Richard D. Lawrence

DOI: 10.1109/DEST.2010.5610621

关键词:

摘要: For many marketing and business applications, it is necessary to know the home page of a company specified only by its name. If we require for small number big companies, this task readily accomplished via use Internet search engines or access domain registration lists. However, if entities interest are these approaches can lead mismatches, particularly lacks page. We address problem using supervised machine-learning approach in which train binary classification model. classify potential website matches each name based on set explanatory features extracted from content candidate website. Our related web-based intelligence two ways: (1) build training our learning algorithms through crowdsourcing tools illustrate their research, (2) success model allows one easily corporate pages as data inputs into other research projects. Through successful crowdsourcing, able identify correct recognize that valid does not exist with an accuracy 57% better than simply taking highest ranked engine result match.

参考文章(14)
David H. Wolpert, Original Contribution: Stacked generalization Neural Networks. ,vol. 5, pp. 241- 259 ,(1992) , 10.1016/S0893-6080(05)80023-1
Pranam Kolari, Tim Finin, Anupam Joshi, SVMs for the Blogosphere: Blog Identification and Splog Detection national conference on artificial intelligence. pp. 92- 99 ,(2006)
Kamal Nigam, Andrew McCallum, A comparison of event models for naive bayes text classification national conference on artificial intelligence. pp. 41- 48 ,(1998)
Eibe Frank, Remco R. Bouckaert, Naive Bayes for Text Classification with Unbalanced Classes Lecture Notes in Computer Science. pp. 503- 510 ,(2006) , 10.1007/11871637_49
V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals Soviet physics. Doklady. ,vol. 10, pp. 707- 710 ,(1966)
S. Kullback, R. A. Leibler, On Information and Sufficiency Annals of Mathematical Statistics. ,vol. 22, pp. 79- 86 ,(1951) , 10.1214/AOMS/1177729694
James Allan, Gerard Salton, Chris Buckley, The effect of adding relevance information in a relevance feedback environment international acm sigir conference on research and development in information retrieval. pp. 292- 300 ,(1994) , 10.5555/188490.188586
A. Kruger, C. L. Giles, F. M. Coetzee, E. Glover, G. W. Flake, S. Lawrence, C. Omlin, DEADLINER Proceedings of the ninth international conference on Information and knowledge management - CIKM '00. pp. 272- 281 ,(2000) , 10.1145/354756.354829
Hwanjo Yu, Jiawei Han, Kevin Chen-Chuan Chang, PEBL Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 239- 248 ,(2002) , 10.1145/775047.775083
William W. Cohen, Data integration using similarity joins and a word-based information representation language ACM Transactions on Information Systems. ,vol. 18, pp. 288- 321 ,(2000) , 10.1145/352595.352598