Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems

作者: Feng Yan , Olatunji Ruwase , Yuxiong He , Trishul Chilimbi

DOI: 10.1145/2783258.2783270

关键词: State spaceScalabilityDeep learningComputer scienceProvisioningBenchmark (computing)Subject-matter expertImage (mathematics)Artificial intelligenceMachine learningArtificial neural network

摘要: Big deep neural network (DNN) models trained on large amounts of data have recently achieved the best accuracy hard tasks, such as image and speech recognition. Training these DNNs using a cluster commodity machines is promising approach since training time consuming compute-intensive. To enable extremely DNNs, are partitioned across machines. expedite very sets, multiple model replicas in parallel different subsets examples with global parameter server maintaining shared weights replicas. The correct choice for partitioning overall system provisioning highly dependent DNN distributed hardware characteristics. These decisions currently require significant domain expertise empirical state space exploration.This paper develops performance that quantify impact scalability. Also, we use to build scalability optimizer efficiently determines optimal configuration minimizes time. We evaluate our state-of-the-art framework two benchmark applications. results show estimate high estimation correctly chooses configurations, minimizing DNNs.

参考文章(35)
Léon Bottou, Large-Scale Machine Learning with Stochastic Gradient Descent Proceedings of COMPSTAT'2010. pp. 177- 186 ,(2010) , 10.1007/978-3-7908-2604-3_16
Karthik Kalyanaraman, Yutaka Suzue, Trishul Chilimbi, Johnson Apacible, Project Adam: building an efficient and scalable deep learning training system operating systems design and implementation. pp. 571- 582 ,(2014) , 10.5555/2685048.2685094
Gregory R. Ganger, Eric Xing, Kimberly Keeton, Seunghak Lee, James Cipar, Qirong Ho, Garth Gibson, Jin Kyu Kim, Solving the straggler problem with bounded staleness hot topics in operating systems. pp. 22- 22 ,(2013)
Danilo P. Mandic, Jonathon Chambers, Recurrent Neural Networks for Prediction: Learning Algorithms,Architectures and Stability John Wiley & Sons, Inc.. ,(2001)
Gurmeet Singh, Carl Kesselman, Ewa Deelman, A provisioning model and its comparison with best-effort for performance-cost optimization in grids high performance distributed computing. pp. 117- 126 ,(2007) , 10.1145/1272366.1272382
P. Jogalekar, M. Woodside, Evaluating the scalability of distributed systems IEEE Transactions on Parallel and Distributed Systems. ,vol. 11, pp. 589- 603 ,(2000) , 10.1109/71.862209
D. H. Hubel, T. N. Wiesel, Receptive fields of single neurones in the cat's striate cortex The Journal of Physiology. ,vol. 148, pp. 574- 591 ,(1959) , 10.1113/JPHYSIOL.1959.SP006308
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, ImageNet: A large-scale hierarchical image database computer vision and pattern recognition. pp. 248- 255 ,(2009) , 10.1109/CVPR.2009.5206848
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition Proceedings of the IEEE. ,vol. 86, pp. 2278- 2324 ,(1998) , 10.1109/5.726791