作者: Chien-Sheng Yang , Ramtin Pedarsani , A. Salman Avestimehr
关键词: Cloud computing 、 Markov model 、 Robustness (computer science) 、 Data encoding 、 Computer science 、 Computation 、 Markov chain 、 High variability 、 Distributed computing
摘要: In modern distributed computing systems, unpredictable and unreliable infrastructures result in high variability of resources. Meanwhile, there is significantly increasing demand for timely event-driven services with deadline constraints. Motivated by measurements over Amazon EC2 clusters, we consider a two-state Markov model speed cloud networks. this model, each worker can be either good state or bad terms the computation speed, transition between these states modeled as chain which unknown to scheduler. We then Coded Computing framework, data possibly encoded stored at nodes order provide robustness against that may state. With requests submitted system deadlines, our goal design optimal computation-load allocation scheme encoding maximize throughput (i.e, average number tasks are accomplished before their deadline). Our main development dynamic strategy called Lagrange Estimate-and-Allocate (LEA) strategy, achieves throughput. It shown compared static LEA improves 1.4x ~ 17.5x various scenarios via simulations 1.27x 6.5x experiments clusters.