作者: Yanfang Le , Feng Wang , Jiangchuan Liu , Funda Ergun
关键词:
摘要: MapReduce has emerged as a powerful tool for distributed and scalable processing of voluminous data. For skewed data input, load balancing is necessary among the worker nodes to minimize overall finishing time, which however can incur massive movement in center network. In this paper, we first time examine problem center-network-aware shuffle sub phase MapReduce. Different from earlier studies that generally assume network inside negligible delay infinite capacity, consider traffic bottlenecks real networks by introducing constraints on available bandwidth, demonstrate corresponding be decomposed into two problems flow balancing, respectively. We show effective solutions both them, together yield complete solution towards near optimal balancing. A much simpler yet performance-wise comparable greedy algorithm also developed fast implementation practice. The effectiveness our been demonstrated synthetic public datasets.