作者: Abhishek Gupta , Bilge Acun , Osman Sarood , Laxmikant V. Kale
DOI: 10.1109/HIPC.2014.7116905
关键词:
摘要: Malleable jobs are those which can dynamically shrink or expand the number of processors on they executing at runtime in response to an external command. significantly improve system utilization and reduce average time, compared traditional jobs. To realize these benefits, three components critical — adaptive job scheduler, resource manager, parallel system. In this paper, we present a novel mechanism for enabling shrink/expand capability using task migration dynamic load balancing, checkpoint-restart, Linux shared memory. Our technique performs true eliminating need any residual processes, requires little application programmer effort, is fast. Further, establish bidirectional communication channel between manager runtime, asynchronous split-phase scheduling decisions. Performance results Charm++ Stampede supercomputer show efficacy, scalability, benefits our approach. Shrinking from 2k 1k cores takes 16s while 40s. Also, demonstrate utility as well emerging scenarios, e.g., proactive fault tolerance clouds.