Towards realizing the potential of malleable jobs

作者: Abhishek Gupta , Bilge Acun , Osman Sarood , Laxmikant V. Kale

DOI: 10.1109/HIPC.2014.7116905

关键词:

摘要: Malleable jobs are those which can dynamically shrink or expand the number of processors on they executing at runtime in response to an external command. significantly improve system utilization and reduce average time, compared traditional jobs. To realize these benefits, three components critical — adaptive job scheduler, resource manager, parallel system. In this paper, we present a novel mechanism for enabling shrink/expand capability using task migration dynamic load balancing, checkpoint-restart, Linux shared memory. Our technique performs true eliminating need any residual processes, requires little application programmer effort, is fast. Further, establish bidirectional communication channel between manager runtime, asynchronous split-phase scheduling decisions. Performance results Charm++ Stampede supercomputer show efficacy, scalability, benefits our approach. Shrinking from 2k 1k cores takes 16s while 40s. Also, demonstrate utility as well emerging scenarios, e.g., proactive fault tolerance clouds.

参考文章(22)
Sayantan Chakravorty, Celso L. Mendes, Laxmikant V. Kalé, Proactive fault tolerance in MPI applications via task migration ieee international conference on high performance computing data and analytics. pp. 485- 496 ,(2006) , 10.1007/11945918_47
Dror G. Feitelson, Larry Rudolph, Towards Convergence in Job Schedulers for Parallel Supercomputers job scheduling strategies for parallel processing. pp. 1- 26 ,(1996) , 10.1007/BFB0022284
Jan Hungershofer, On the combined scheduling of malleable and rigid jobs symposium on computer architecture and high performance computing. pp. 206- 213 ,(2004) , 10.1109/SBAC-PAD.2004.27
Su-Hui Chiang, Mary K. Vernon, Dynamic vs. Static Quantum-Based Parallel Processor Allocation job scheduling strategies for parallel processing. pp. 200- 223 ,(1996) , 10.1007/BFB0022295
Eric de Sturler, Milind Bhandarkar, L. V. Kalé, Object-Based Adaptive Load Balancing for MPI Programs∗ ,(2000)
Milind Bhandarkar, Laxmikant V Kalé, Eric de Sturler, Jay Hoeflinger, Adaptive Load Balancing for MPI Programs international conference on computational science. pp. 108- 117 ,(2001) , 10.1007/3-540-45718-6_13
Dror G. Feitelson, Larry Rudolph, Uwe Schwiegelshohn, Kenneth C. Sevcik, Parkson Wong, Theory and Practice in Parallel Job Scheduling job scheduling strategies for parallel processing. pp. 1- 34 ,(1997) , 10.1007/3-540-63574-2_14
Márcia C. Cera, Yiannis Georgiou, Olivier Richard, Nicolas Maillard, Philippe O. A. Navaux, Supporting malleability in parallel architectures with dynamic CPUSETs mapping and dynamic MPI international conference of distributed computing and networking. ,vol. 5935, pp. 242- 257 ,(2010) , 10.1007/978-3-642-11322-2_26
Richard A. Dutton, Weizhen Mao, Online scheduling of malleable parallel jobs iasted international conference on parallel and distributed computing and systems. pp. 136- 141 ,(2007)