作者: MATTHEW JOSEPH SOTTILE , AROON NATARAJ , Allen Malony , ALAN MORRIS , SAMEER SHENDE
DOI:
关键词:
摘要: Online or Real-time application performance monitoring allows tracking performance characteristics during execution as opposed to doing so post-mortem. This opens up several possibilities otherwise unavailable such as real-time visualization and application performance steering that can be useful in the context of long-running applications. Two fundamental components that constitute such a performance monitor are the measurement and transport systems. The former captures performance metrics of individual contexts (processes, threads). The latter enables querying the parallel/distributed state from the different contexts and also allows measurement control. As HPC systems grow in size and complexity, the key challenge is to keep the online performance monitor scalable and low overhead while still providing a useful performance reporting capability. We adapt and combine two existing, mature systems - Tuning and Analysis Utility (TAU) and Supermon - to address this problem. Tau performs the measurement while Supermon is used to collect the distributed measurement state. Our experiments show that this novel approach of using a cluster-monitor, Supermon, as the transport for online performance data from Tau leads to very low-overhead application monitoring as well as other beneits unavailable from using a traditional transport such as NFS.