Statistical Analysis of Spherical Data:Clustering, Feature Selection andApplications

作者: Ola Amayri

DOI:

关键词: Statistical modelFeature selectionCluster analysisProbabilistic logicMixture modelData modelingMachine learningKnowledge extractionData miningArtificial intelligenceComputer scienceBayesian inference

摘要: In the light of interdisciplinary applications, data to be studied and analyzed have witnessed a growth in volume change their intrinsic structure type. other words, practice diversity resources generating objects imposed several challenges for decision maker determine informative terms time, model capability, scalability knowledge discovery. Thus, it is highly desirable able extract patterns interest that support management. Clustering, among machine learning approaches, an important engineering technique empowers automatic discovery similar object’s clusters the consequent assignment new unseen appropriate clusters. this context, majority current research does not completely address true nature particular application at hand. contrast most previous research, our proposed work focuses on modeling classification spherical are naturally generated many mining applications. Thus, thesis we propose estimation feature selection frameworks based Langevin distribution which devoted offline online settings. In thesis, first formulate unified probabilistic framework, where build kernels Fisher score information divergences from finite mixture Support Vector Machine. We motivated by fact blending generative discriminative approaches has prevailed exploring adopting distinct characteristic each approach toward constructing complementary system combining best both. Due high demand construct compact accurate statistical models automatically adjustable dynamic changes, next high-dimensional mixtures allow simultaneous clustering settings. To end, adopted long been heavily relied deterministic such as maximum likelihood estimation. Despite successful utilization wide spectrum areas, these drawbacks will discuss thesis. An alternative approach adoption Bayesian inference addresses uncertainty while ensuring good generalization. issue, also selection. When dynamically grow drastically, always feasible solution. In with suppose unknown number components, finally nonparametric assumes infinite number components. further enhance simultaneously detecting features process clustering. Through extensive empirical experiments, demonstrate merits diverse dimensional datasets challenging real-world

参考文章(159)
Shyhtsun Felix Wu, Gregory L. Wittel, On Attacking Statistical Spam Filters. conference on email and anti-spam. ,(2004)
Nuno Vasconcelos, Antoni B. Chan, Pedro J. Moreno, A Family of Probabilistic Kernels Based on Information Divergence ,(2004)
Gary Ulrich, Computer Generation of Distributions on the m-Sphere Applied Statistics. ,vol. 33, pp. 158- 163 ,(1984) , 10.2307/2347441
Léon Bottou, Large-Scale Machine Learning with Stochastic Gradient Descent Proceedings of COMPSTAT'2010. pp. 177- 186 ,(2010) , 10.1007/978-3-7908-2604-3_16
Mark Dredze, Reuven Gevaryahu, Ari Elias-Bachrach, Learning Fast Classifiers for Image Spam. conference on email and anti-spam. ,(2007)
Huan Liu, Hiroshi Motoda, None, Computational Methods of Feature Selection Chapman and Hall/CRC. ,(2007) , 10.1201/9781584888796
K. V. Mardia, Distribution Theory for the Von Mises-Fisher Distribution and Its Application A Modern Course on Statistical Distributions in Scientific Work. pp. 113- 130 ,(1975) , 10.1007/978-94-010-1842-5_10
Arindam Banerjee, Sugato Basu, Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning. siam international conference on data mining. pp. 431- 436 ,(2007)
Ralf Herbrich, Thore Graepel, A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work neural information processing systems. ,vol. 13, pp. 224- 230 ,(2000)
J. Allan, S. Harding, D. Fisher, A. Bolivar, S. Guzman-Lara, P. Amstutz, Taking Topic Detection From Evaluation to Practice hawaii international conference on system sciences. pp. 101- ,(2005) , 10.1109/HICSS.2005.576