作者: Saurabh Bagchi , Tanzima Zerin Islam
DOI:
关键词:
摘要: By leveraging the enormous amount of computational capabilities, scientists today are being able to make significant progress in solving problems, ranging from finding cure cancer -- using fusion world's clean energy crisis. The number components extreme scale computing environments is growing exponentially. Since failure rate each component starts factoring in, reliability overall systems decreases proportionately. Hence, spite having these groundbreaking simulations may never run completion. only way ensure their timely completion by making reliable, so that no can hinder science. On such systems, long running scientific applications periodically store execution states checkpoint files on stable storage, and recover a restarting last saved file. Resilient high-throughput high-performance enable simulate problems at granularities finer than ever thought possible. Unfortunately, this explosion capabilities generates large amounts state. As result, today's checkpointing crumble under increased data. Additionally, network I/O bandwidth not nearly as fast compute cycles. These two factors have caused scalability challenges for systems. focus thesis develop scalable different – grids clusters. In grid environment, machine owners voluntarily share idle CPU cycles with other users system, performance degradation host processes remain certain threshold. challenge an environment end-to-end application given high-rate unavailability machines guest-job eviction. Today's often use expensive, dedicated servers. In thesis, we present system FALCON, which uses available disk resources shared repositories. However, unavailable storage lead loss Therefore, model failures hosts predict availability Experiments production DiaGrid show FALCON improves benchmark applications, write gigabytes data, between 11% 44% compared widely used Condor solutions. (HPC) checkpoints parallel file (PFS). up, checkpoint-restart incurs high overheads due contention PFS resources. force large-scale reduce frequency, means more time lost event failure. We alleviate problem developing MCRENGINE. MCRENGINE aggregates multiple knowledge data semantics through libraries, e.g., HDF5 netCDF, compresses them. Our novel scheme compressibility up 115% over simple concatenation compression. evaluation reduces overhead 87% restart 62% baseline aggregation or We believe contributions made serve good foundation further research improving large-scale, distributed environments.