摘要: The Tera Multithreaded Architecture, or Mta, addresses scalable shared memory system design with a different approach; it tolerates latency through providing fast access to multiple threads of execution. The Mta employs a number of radical design ideas: creation of hardware threads (streams) with frequent context switching; full-empty bits for each memory word; a flat memory hierarchy; and deep pipelines. Recent evaluations of the Mta have taken a top-down approach: port applications and application benchmarks, and compare the absolute performance with conventional systems. While useful, these studies do not reveal the effect of the Tera Mta’s unique hardware features on an application. We present a bottom-up approach to the evaluation of the Mta via a suite of microbenchmarks to examine in detail the underlying hardware mechanisms and the cost of runtime system support for multithreading. In particular, we measure memory, network, and instruction latencies; memory bandwidth; the cost of low-level synchronization via full-empty bits; overhead for stream management; and the effects of software pipelining. These data should provide a foundation for performance modeling on the Mta. We also present results for list ranking on the Mta, an application which has traditionally been difficult to scale on conventional parallel systems.