作者: Amir Bahmani , Frank Mueller
DOI: 10.1016/J.JPDC.2017.06.008
关键词: TRACE (psycholinguistics) 、 Distributed computing 、 Scalability 、 Set (abstract data type) 、 Partition coefficient 、 Computer science 、 Time complexity 、 Node (networking) 、 Parallel computing 、 Cluster analysis
摘要: Abstract Communication traces help developers of high-performance computing (HPC) applications understand and improve their codes. When run on large-scale HPC facilities, the scalability tracing tools becomes a challenge. To address this problem, can be clustered into groups processes that exhibit similar behavior. Instead collecting trace information each individual node, it then suffices to collect small set representative nodes, namely one per cluster. However, clustering algorithms themselves need have low overhead, scalable, adapt application characteristics. We devised an adaptive algorithm for called ACURDION MPI communication code with O(log P) time complexity. First, identifies parameters differ across by using logarithmic Adaptive Signature Building. Second, clusters based those parameters. Experiments show just nine nodes/clusters capture behavior all nodes wide benchmarks codes while retaining sufficient accuracy events In summary, improves automation over prior approaches.