Middleware for data mining applications on clusters and grids

作者: Leonid Glimcher , Ruoming Jin , Gagan Agrawal

DOI: 10.1016/J.JPDC.2007.06.007

关键词: GridMiddleware (distributed applications)Information extractionData warehouseDistributed computingData stream miningNode (computer science)MiddlewareComputer scienceData miningDatabaseTransaction processingData retrieval

摘要: This paper gives an overview of two middleware systems that have been developed over the last 6 years to address challenges involved in developing parallel and distributed implementations data mining algorithms. FREERIDE (FRamework for Rapid Implementation Data Engines) focuses on a cluster environment. is based observation versions several well-known techniques share relatively similar structure, can be parallelized by dividing instances (or records or transactions) among nodes. The computation each node involves reading arbitrary order, processing instance, performing local reduction. reduction only commutative associative operations, which means result independent order are processed. After node, global performed. similarity structure exploited system execute tasks efficiently parallel, starting from high-level specification technique. To enable sets stored remote repositories, we extended into FREERIDE-G Engines Grid). supports interface scientific applications involve repositories. added functionality aims at abstracting details retrieval, movements, caching application developers.

参考文章(35)
Peter Brezany, A Min Tjoa, Jürgen Hofer, Guenter Kickinger, Grid knowledge discovery processes and an architecture for their composition. Parallel and distributed computing and networks. pp. 76- 81 ,(2004)
Werner Dubitzky, Vlado Stankovski, Damian McCourt, Assaf Schuster, Michael May, Jürgen Franke, A Service-Centric Perspective for Data Mining in Complex Problem Solving Environments. parallel and distributed processing techniques and applications. pp. 780- 787 ,(2004)
Ruoming Jin, Gagan Agrawal, Shared Memory Paraellization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. siam international conference on data mining. pp. 77- 94 ,(2002)
Ruoming Jin, Gagan Agrawal, Shared Memory Parallelization of Decision Tree Construction Using a General Data Mining Middleware european conference on parallel processing. pp. 346- 354 ,(2002) , 10.1007/3-540-45706-2_46
John C. Shafer, Rakesh Agrawal, Manish Mehta, SPRINT: A Scalable Parallel Classifier for Data Mining very large data bases. pp. 544- 555 ,(1996)
Ruoming Jin, Gagan Agrawal, Communication and Memory Efficient Parallel Decision Tree Construction. siam international conference on data mining. pp. 119- 129 ,(2003) , 10.1137/1.9781611972733.11
John Stutz, Peter Cheeseman, Bayesian classification (AutoClass): theory and results knowledge discovery and data mining. pp. 153- 180 ,(1996)
P. Becuzzi, M. Coppola, M. Vanneschi, Mining of Association Rules in Very Large Databases: A Structured Parallel Approach european conference on parallel processing. ,vol. 1685, pp. 1441- 1450 ,(1999) , 10.1007/3-540-48311-X_204
Raghu Machiraju, James E. Fowler, David Thompson, Bharat Soni, Will Schroeder, EVITA — Efficient Visualization and Interrogation of Tera-Scale Data Springer, Boston, MA. pp. 257- 279 ,(2001) , 10.1007/978-1-4615-1733-7_15