摘要: MapReduce has emerged as a powerful tool for distributed and scalable processing of voluminous data. Different from earlier heuristics in the very late reduce stage or after seeing all data, we address skew beginning data input, make no assumption about priori knowledge distribution nor require synchronized operations. We show that optimal strategy is constrained version online minimum makespan and, context where pairs with identical keys must be scheduled to same machine, propose an algorithm provable 2-competitive ratio. further suggest sample-based enhancement, which, probabilistically, achieves 3/2-competitive ratio bounded error. Examine project again, found datacenter network could bottleneck shuffle subphase MapReduce. This potentially lead poor overall performance even balanced workload thus needed addressed. Earlier studies either assume inside negligible delay infinite capacity, use hop count cost measurement. consider realistic bandwidth constraints real world networks effective solution toward near datacenter-networkaware load balancing.