Accelerating hierarchical acoustic likelihood computation on graphics processors.

作者: Miroslav Novak , Pavel Kveton

DOI:

关键词:

摘要: pavel.kveton@cz.ibm.com, miroslav@us.ibm.comABSTRACTThe paper presents a method for performance improvements ofa speech recognition system by moving part of the computation -acoustic likelihood - onto Graphics Processor Unit(GPU). In system, GPU operates as low cost powerful co-processor linear algebra operations. The compares GPUimplementation two techniques acoustic computa-tion: fullGaussian all components and significantlyfaster Gaussian selection using hierarchical evaluation. Thefull is an ideal candidate implemen-tation because its matrix multiplication nature. hierarchicalGaussian technique commonly used on aCPU sinceit leads to much better pruning vol-ume. Pruning are generally harder implementon GPUs, nevertheless, shows that Gaussiancomputation can still be done with perfor-mance than full computation.Index Terms— Speech recognition, GPU, Parallelization1. INTRODUCTIONSpeech CPU-intensive tasks. Acoustic likelihoodevaluation usually represents 30-50% computation, whichmakes it good optimization. A lot work has beendedicated increase scoring CPU.One successful methods selection[1, 2] leading 3-4x speed up in comparison Gaussianevaluation without noteworthy accuracy loss. Significant also achieved use Streaming SIMDExtension (SSE) instruction set [2, 3].Graphical processor units (GPUs) represent low-cost computa-tional power been reserved image/video processing tasksfor long time. An introduction General Purpose computationson (GPGPU, e.g. CUDA from NVIDIA [4], ATI Stream fromAMD [5]) made feasible GPUs other Re-cently, OpenCL [6] standard cross-platform parallelprogram-ming developed cooperation many industry-leadingcompanies institutions. Both AMD provide anOpenCL SDK drivers their GPUs.With dawn GPGPU environments graphics processorsbecome affordable coprocessors algebraoperations recognition. Recent works show contribu-tion mainly evaluation, which straight-forward implement due na-ture. Dixon at al. [7] based evalua-tion 4–6x faster on-demand CPU evaluation largevocabulary continuous task (LVCSR). Cardinalat [3] reports 5x thanan SSE-optimized 35%speed LVCSR task. Chong et [8] implemented model-optimized together theViterbi search showing approx. 9x CPU, gener-alized later You [9] comparable results. Shi [10]employs finding clusters Gaussians fMPE discri mi-native training achieving 17x speed-up this task.This implementation both SSE optimized implementations. Al-though requires global clusterswhich not very efficient acceptable combination GPUand usage found 2x incomparison GPU.2. SOFTAW RE AND HARDAW ENVIRONMENTIn we have architecture[4] SIMT (Single Instruction Multiple Thread) model. Among several options, chosen [6]as programming framework platform-independent.The hence uses terminology.An program akernel code segment exe-cuted N-Dimensional range(NDrange) work-items(all work-items execute same code). NDrange splitintoequally-sizedwork-groups. one work-group runtogether streaming multiprocessor where they syn-chronized share small low-latencylocal memory (typi-cally 16kB). There no communication available between differentwork-groups except termination work-itemsin theNDrange.In addition local memory, access toglobal (large high-latency, typically 1GB). por-tion declared constant (read-only,typically 64kB) then cashed .Each split intohalf-warps(two half-warps forma warp) executed together. Optimization ofthe within crucial toachieve best GPU. Basically, recommendedto or possible favor ofglobal however, spaces optimal ac-cess strategy CUDA-enabled chips — if fol-lowed, half-warp parallel re-sulting throughput. Otherwise, isserialized serious impact. isorganized into banks avoid bank conflicts; globalmemory coalescent half-

参考文章(8)
Frank Seide, Yu Shi, Frank K. Soong, GPU-accelerated Gaussian clustering for fMPE discriminative training. conference of the international speech communication association. pp. 944- 947 ,(2008)
Pierre Dumouchel, Gilles Boulianne, Patrick Cardinal, Michel Comeau, GPU accelerated acoustic likelihood computations. conference of the international speech communication association. pp. 964- 967 ,(2008)
Paul R. Dixon, Tasuku Oonishi, Sadaoki Furui, Fast acoustic computations using graphics processors international conference on acoustics, speech, and signal processing. pp. 4321- 4324 ,(2009) , 10.1109/ICASSP.2009.4960585
George Saon, Geoffrey Zweig, Daniel Povey, Anatomy of an extremely fast LVCSR decoder conference of the international speech communication association. pp. 549- 552 ,(2005)
H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, G. Zweig, The IBM 2004 conversational telephony system for rich transcription international conference on acoustics, speech, and signal processing. ,vol. 1, pp. 205- 208 ,(2005) , 10.1109/ICASSP.2005.1415086
Kisun You, Jike Chong, Youngmin Yi, Ekaterina Gonina, Christopher Hughes, Yen-Kuang Chen, Wonyong Sung, Kurt Keutzer, Parallel scalability in speech recognition IEEE Signal Processing Magazine. ,vol. 26, pp. 124- 135 ,(2009) , 10.1109/MSP.2009.934124
Kurt Keutzer, Nadathur Rajagopalan Satish, Jike Chong, Youngmin Yi, Arlo Faria, Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors ,(2008)
K.M. Knill, M.J.F. Gales, S.J. Young, Use of Gaussian selection in large vocabulary continuous speech recognition using HMMS international conference on spoken language processing. ,vol. 1, pp. 470- 473 ,(1996) , 10.1109/ICSLP.1996.607156