Exploring the power of heterogeneous information sources

作者: Jing Gao , Jiawei Han

DOI:

关键词:

摘要: The big data challenge is one unique opportunity for both mining and database research engineering. A vast ocean of are collected from trillions connected devices in real time on a daily basis, useful knowledge usually buried multiple genres, different sources, formats, with types representation. Many interesting patterns cannot be extracted single collection, but have to discovered the integrative analysis all heterogeneous sources available. Although many algorithms been developed analyze information applications continuously pose new challenges: Data can gigantic, noisy, unreliable, dynamically evolving, highly imbalanced, heterogeneous. Meanwhile, users provide limited feedback, growing privacy concerns, ask actionable knowledge. In this thesis, we propose explore power such challenging learning scenarios. There two perspectives correlations among sources: Explore their similarities (consensus combination), or differences (inconsistency detection). In consensus combination, focus task classification sources. Multiple same set objects complimentary predictive powers, by combining expertise, prediction accuracy significantly improved. However, major that it hard obtain sufficient reliable labeled effective training because they require efforts experienced human annotators. some may only large amount unlabeled data. do not directly generate label predictions, constraints task. Therefore, first graph based maximization framework combine supervised unsupervised models obtained available We further demonstrate benefits specific transfer learning, an model combination target domain no also robustness evolving data. On other hand, when unexpected disagreement encountered across diverse might raise red flag in-depth investigation. Another line my thesis find anomalies. spectral method detect performing inconsistently as type Traditional anomaly detection methods discover anomalies degree deviation normal source, whereas proposed approach detects according inconsistencies principle inconsistency benefit applications, particular, show how help identify networks distributed systems. probabilistic social community comparing link node information, system problems machines systems modeling machines. In go beyond scope traditional ensemble address challenges faced With framework, longer requirement successful multi-source classification, instead, use existing labeling experts maximized integratingknowledge relevant do- mains concept opens up direction detection. detected anomalies, which found techniques, insights into application area. proved areas, including network analysis, cyber-security, business intelligence, potential being applied healthcare, bioinformatics, energy efficiency. As number our world exploding, there still great opportunities well numerous inference massive collections.

参考文章(166)
Wei Fan, Ian Davidson, On Sample Selection Bias and Its Efficient Correction via Model Averaging and Unlabeled Examples. siam international conference on data mining. pp. 320- 331 ,(2007)
Jing Gao, Wei Fan, Jiawei Han, Philip S. Yu, A general framework for mining concept-drifting data streams with skewed distributions siam international conference on data mining. pp. 3- 14 ,(2007) , 10.1137/1.9781611972771.1
Philip S. Yu, Zhongfei (Mark) Zhang, Bo Long, A general model for multiple view unsupervised learning siam international conference on data mining. pp. 822- 833 ,(2008)
Jing Gao, Haibin Cheng, Pang-Ning Tan, A Novel Framework for Incorporating Labeled Examples into Anomaly Detection. siam international conference on data mining. pp. 594- 598 ,(2006)
Mike Y Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, Eric Brewer, None, Pinpoint: problem determination in large, dynamic Internet services dependable systems and networks. pp. 595- 604 ,(2002) , 10.1109/DSN.2002.1029005
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander, LOF: identifying density-based local outliers international conference on management of data. ,vol. 29, pp. 93- 104 ,(2000) , 10.1145/335191.335388
David Hardoon, Jason Farquhar, Hongying Meng, John S. Shawe-taylor, Sándor Szedmák, Two view learning: SVM-2K, Theory and Practice neural information processing systems. ,vol. 18, pp. 355- 362 ,(2005)
Huajing Li, Zaiqing Nie, Wang-Chien Lee, Lee Giles, Ji-Rong Wen, Scalable community discovery on textual data with relations Proceeding of the 17th ACM conference on Information and knowledge mining - CIKM '08. pp. 1203- 1212 ,(2008) , 10.1145/1458082.1458241
Indrajit Bhattacharya, Lise Getoor, Collective entity resolution in relational data ACM Transactions on Knowledge Discovery From Data. ,vol. 1, pp. 5- ,(2007) , 10.1145/1217299.1217304