Workload-driven design and evaluation of large-scale data-centric systems

作者: Yanpei Chen , Randy H. Katz

DOI:

关键词:

摘要: Large-scale data-centric systems help organizations store, manipulate, and derive value from large volumes of data. They consist distributed components spread across a scalable number connected machines involve complex software/hardware stacks with multiple semantic layers. These solve established problems involving amounts data, while catalyzing new, data-driven businesses such as search engines, social networks, cloud computing data storage service providers. The complexity, diversity, scale, rapid evolution large-scale make it challenging to develop intuition about these systems, gain operational experience, improve performance. It is an important research problem method design evaluate based on the empirical behavior targeted workloads. Using unprecedented collection nine industrial workload traces business-critical we workload-driven evaluation for apply address previously unsolved problems. Specifically, dissertation contributes following: 1. A conceptual framework breaking down workloads into access patterns, computation load arrival patterns. 2. analysis synthesis that uses multi-dimensional, non-parametric statistics extract insights produce representative behavior. 3. Case studies deployments MapReduce enterprise network two examples systems. 4. energy-efficient system Internet datacenter transport protocol pathologies, topics require workload-specific address. Overall, develops more objective systematic understanding emerging class computer work in this helps further accelerate adoption real life relevant business, science, day-to-day consumers.

参考文章(68)
Scott Shenker, Ali Ghodsi, Matei Zaharia, Andrew Konwinski, Anthony D. Joseph, Benjamin Hindman, Ion Stoica, Nexus: A Common Substrate for Cluster Computing ,(2009)
Scott Shenker, Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Ion Stoica, Job Scheduling for Multi-User MapReduce Clusters ,(2009)
Wei Xu, Ling Huang, Armando Fox, David A Patterson, Michael I Jordan, None, Mining console logs for large-scale system problem detection usenix workshop on tackling computer systems problems with machine learning techniques. pp. 4- 4 ,(2008)
Scott Shenker, Ali Ghodsi, Dhruba Borthakur, Srikanth Kandula, Ganesh Ananthanarayanan, Ion Stoica, Andrew Wang, PACMan: coordinated memory caching for parallel jobs networked systems design and implementation. pp. 20- 20 ,(2012)
Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Ion Stoica, Randy Katz, Improving MapReduce performance in heterogeneous environments operating systems design and implementation. pp. 29- 42 ,(2008) , 10.5555/1855741.1855744
Dhruba Borthakur, Samuel Rash, Rodrigo Schmidt, Amitanand Aiyer, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkaruppan, Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan, Dmytro Molkov, Aravind Menon, Apache hadoop goes realtime at Facebook international conference on management of data. pp. 1071- 1080 ,(2011) , 10.1145/1989323.1989438
Jerome H. Saltzer, A simple linear model of demand paging performance Communications of the ACM. ,vol. 17, pp. 181- 186 ,(1974) , 10.1145/360924.360926
Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, Peter J. Weinberger, Quickly generating billion-record synthetic databases international conference on management of data. ,vol. 23, pp. 243- 252 ,(1994) , 10.1145/191839.191886
R. Bianchini, R. Rajamony, Power and energy management for server systems IEEE Computer. ,vol. 37, pp. 68- 74 ,(2004) , 10.1109/MC.2004.217
Willis Lang, Jignesh M. Patel, Energy management for MapReduce clusters Proceedings of the VLDB Endowment. ,vol. 3, pp. 129- 139 ,(2010) , 10.14778/1920841.1920862