Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems

作者： Feng Yan , Olatunji Ruwase , Yuxiong He , Trishul Chilimbi

关键词: State space 、 Scalability 、 Deep learning 、 Computer science 、 Provisioning 、 Benchmark (computing) 、 Subject-matter expert 、 Image (mathematics) 、 Artificial intelligence 、 Machine learning 、 Artificial neural network

摘要: Big deep neural network (DNN) models trained on large amounts of data have recently achieved the best accuracy hard tasks, such as image and speech recognition. Training these DNNs using a cluster commodity machines is promising approach since training time consuming compute-intensive. To enable extremely DNNs, are partitioned across machines. expedite very sets, multiple model replicas in parallel different subsets examples with global parameter server maintaining shared weights replicas. The correct choice for partitioning overall system provisioning highly dependent DNN distributed hardware characteristics. These decisions currently require significant domain expertise empirical state space exploration.This paper develops performance that quantify impact scalability. Also, we use to build scalability optimizer efficiently determines optimal configuration minimizes time. We evaluate our state-of-the-art framework two benchmark applications. results show estimate high estimation correctly chooses configurations, minimizing DNNs.

参考文章(35)

Léon Bottou, Large-Scale Machine Learning with Stochastic Gradient Descent Proceedings of COMPSTAT'2010. pp. 177- 186 ,(2010) , 10.1007/978-3-7908-2604-3_16

Karthik Kalyanaraman, Yutaka Suzue, Trishul Chilimbi, Johnson Apacible, Project Adam: building an efficient and scalable deep learning training system operating systems design and implementation. pp. 571- 582 ,(2014) , 10.5555/2685048.2685094

Gregory R. Ganger, Eric Xing, Kimberly Keeton, Seunghak Lee, James Cipar, Qirong Ho, Garth Gibson, Jin Kyu Kim, Solving the straggler problem with bounded staleness hot topics in operating systems. pp. 22- 22 ,(2013)

Danilo P. Mandic, Jonathon Chambers, Recurrent Neural Networks for Prediction: Learning Algorithms,Architectures and Stability John Wiley & Sons, Inc.. ,(2001)

Gurmeet Singh, Carl Kesselman, Ewa Deelman, A provisioning model and its comparison with best-effort for performance-cost optimization in grids high performance distributed computing. pp. 117- 126 ,(2007) , 10.1145/1272366.1272382

Yoshua Bengio, Learning Deep Architectures for AI ,(2009)

P. Jogalekar, M. Woodside, Evaluating the scalability of distributed systems IEEE Transactions on Parallel and Distributed Systems. ,vol. 11, pp. 589- 603 ,(2000) , 10.1109/71.862209

D. H. Hubel, T. N. Wiesel, Receptive fields of single neurones in the cat's striate cortex The Journal of Physiology. ,vol. 148, pp. 574- 591 ,(1959) , 10.1113/JPHYSIOL.1959.SP006308

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, ImageNet: A large-scale hierarchical image database computer vision and pattern recognition. pp. 248- 255 ,(2009) , 10.1109/CVPR.2009.5206848

10.

Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition Proceedings of the IEEE. ,vol. 86, pp. 2278- 2324 ,(1998) , 10.1109/5.726791

Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems

来源期刊

我的账户

Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems

来源期刊

相似文章 10

我的账户