Efficient parallelization of the Discrete Wavelet Transform algorithm using memory-oblivious optimizations

作者: Anastasis Keliris , Vasilis Dimitsas , Olympia Kremmyda , Dimitris Gizopoulos , Michail Maniatakos

DOI: 10.1109/PATMOS.2015.7347583

关键词:

摘要: As the rate of single-thread CPU performance improvement per generation has diminished due to lower transistor-speed scaling and energy related issues, researchers industry have shifted their interest towards multi-core many-core architectures for improving performance. Comparisons between optimized applications parallel been quantified many times in literature, but contradictory results reported mainly biased methods evaluating comparing these architectures. In this paper, we present memory-oblivious optimizations widely used Discrete Wavelet Transform (DWT), provide detailed comparisons algorithm on Intel AMD CPUs, Nvidia GPUs, as well Intel's Xeon Phi coprocessor. Our indicate that, compared respective non-optimized single thread implementations, optimization delivers up 17.9×–197.2× various examined. Furthermore, state-of-the-art, presented GPU implementations are 2.6× 1.3× faster respectively than fastest DWT currently available literature. No comparison state-of-the-art can be made Phi, as, best our knowledge, is first study that optimizes newfangled architecture.

参考文章(17)
James Reinders, James Jeffers, Intel Xeon Phi Coprocessor High Performance Programming ,(2013)
John L. Hennessy, David A. Patterson, Computer Organization and Design: the Hardware/Software Interface ,(1993)
Ali N. Akansu, Wouter A. Serdijn, Ivan W. Selesnick, Full length article: Emerging applications of wavelets: A review Physical Communication. ,vol. 3, pp. 1- 18 ,(2010) , 10.1016/J.PHYCOM.2009.07.001
John E. Stone, David Gohara, Guochun Shi, OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems computational science and engineering. ,vol. 12, pp. 66- 73 ,(2010) , 10.1109/MCSE.2010.69
L. Dagum, R. Menon, OpenMP: an industry standard API for shared-memory programming computational science and engineering. ,vol. 5, pp. 46- 55 ,(1998) , 10.1109/99.660313
Anastasis Keliris, Michail Maniatakos, Investigating large integer arithmetic on Intel Xeon Phi SIMD extensions international conference on design and technology of integrated systems in nanoscale era. pp. 1- 6 ,(2014) , 10.1109/DTIS.2014.6850661
Majid Rabbani, Rajan Joshi, An overview of the JPEG 2000 still image compression standard Signal Processing-image Communication. ,vol. 17, pp. 3- 48 ,(2002) , 10.1016/S0923-5965(01)00024-8
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, Yuli Zhou, Cilk: An Efficient Multithreaded Runtime System Journal of Parallel and Distributed Computing. ,vol. 37, pp. 55- 69 ,(1996) , 10.1006/JPDC.1996.0107