Accelerating hierarchical acoustic likelihood computation on graphics processors.

作者： Miroslav Novak , Pavel Kveton

DOI:

关键词:

摘要: pavel.kveton@cz.ibm.com, miroslav@us.ibm.comABSTRACTThe paper presents a method for performance improvements ofa speech recognition system by moving part of the computation -acoustic likelihood - onto Graphics Processor Unit(GPU). In system, GPU operates as low cost powerful co-processor linear algebra operations. The compares GPUimplementation two techniques acoustic computa-tion: fullGaussian all components and signiﬁcantlyfaster Gaussian selection using hierarchical evaluation. Thefull is an ideal candidate implemen-tation because its matrix multiplication nature. hierarchicalGaussian technique commonly used on aCPU sinceit leads to much better pruning vol-ume. Pruning are generally harder implementon GPUs, nevertheless, shows that Gaussiancomputation can still be done with perfor-mance than full computation.Index Terms— Speech recognition, GPU, Parallelization1. INTRODUCTIONSpeech CPU-intensive tasks. Acoustic likelihoodevaluation usually represents 30-50% computation, whichmakes it good optimization. A lot work has beendedicated increase scoring CPU.One successful methods selection[1, 2] leading 3-4x speed up in comparison Gaussianevaluation without noteworthy accuracy loss. Signiﬁcant also achieved use Streaming SIMDExtension (SSE) instruction set [2, 3].Graphical processor units (GPUs) represent low-cost computa-tional power been reserved image/video processing tasksfor long time. An introduction General Purpose computationson (GPGPU, e.g. CUDA from NVIDIA [4], ATI Stream fromAMD [5]) made feasible GPUs other Re-cently, OpenCL [6] standard cross-platform parallelprogram-ming developed cooperation many industry-leadingcompanies institutions. Both AMD provide anOpenCL SDK drivers their GPUs.With dawn GPGPU environments graphics processorsbecome affordable coprocessors algebraoperations recognition. Recent works show contribu-tion mainly evaluation, which straight-forward implement due na-ture. Dixon at al. [7] based evalua-tion 4–6x faster on-demand CPU evaluation largevocabulary continuous task (LVCSR). Cardinalat [3] reports 5x thanan SSE-optimized 35%speed LVCSR task. Chong et [8] implemented model-optimized together theViterbi search showing approx. 9x CPU, gener-alized later You [9] comparable results. Shi [10]employs ﬁnding clusters Gaussians fMPE discri mi-native training achieving 17x speed-up this task.This implementation both SSE optimized implementations. Al-though requires global clusterswhich not very efﬁcient acceptable combination GPUand usage found 2x incomparison GPU.2. SOFTAW RE AND HARDAW ENVIRONMENTIn we have architecture[4] SIMT (Single Instruction Multiple Thread) model. Among several options, chosen [6]as programming framework platform-independent.The hence uses terminology.An program akernel code segment exe-cuted N-Dimensional range(NDrange) work-items(all work-items execute same code). NDrange splitintoequally-sizedwork-groups. one work-group runtogether streaming multiprocessor where they syn-chronized share small low-latencylocal memory (typi-cally 16kB). There no communication available between differentwork-groups except termination work-itemsin theNDrange.In addition local memory, access toglobal (large high-latency, typically 1GB). por-tion declared constant (read-only,typically 64kB) then cashed .Each split intohalf-warps(two half-warps forma warp) executed together. Optimization ofthe within crucial toachieve best GPU. Basically, recommendedto or possible favor ofglobal however, spaces optimal ac-cess strategy CUDA-enabled chips — if fol-lowed, half-warp parallel re-sulting throughput. Otherwise, isserialized serious impact. isorganized into banks avoid bank conﬂicts; globalmemory coalescent half-

uni-trier.de 本地加速

isca-speech.org 本地加速

暂无可下载资源，当前可以选择系统获取到有开放资源时通知我或者直接发起求助文献求助

参考文章(8)

Frank Seide, Yu Shi, Frank K. Soong, GPU-accelerated Gaussian clustering for fMPE discriminative training. conference of the international speech communication association. pp. 944- 947 ,(2008)

Pierre Dumouchel, Gilles Boulianne, Patrick Cardinal, Michel Comeau, GPU accelerated acoustic likelihood computations. conference of the international speech communication association. pp. 964- 967 ,(2008)

Paul R. Dixon, Tasuku Oonishi, Sadaoki Furui, Fast acoustic computations using graphics processors international conference on acoustics, speech, and signal processing. pp. 4321- 4324 ,(2009) , 10.1109/ICASSP.2009.4960585

George Saon, Geoffrey Zweig, Daniel Povey, Anatomy of an extremely fast LVCSR decoder conference of the international speech communication association. pp. 549- 552 ,(2005)

H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, G. Zweig, The IBM 2004 conversational telephony system for rich transcription international conference on acoustics, speech, and signal processing. ,vol. 1, pp. 205- 208 ,(2005) , 10.1109/ICASSP.2005.1415086

Kisun You, Jike Chong, Youngmin Yi, Ekaterina Gonina, Christopher Hughes, Yen-Kuang Chen, Wonyong Sung, Kurt Keutzer, Parallel scalability in speech recognition IEEE Signal Processing Magazine. ,vol. 26, pp. 124- 135 ,(2009) , 10.1109/MSP.2009.934124

Kurt Keutzer, Nadathur Rajagopalan Satish, Jike Chong, Youngmin Yi, Arlo Faria, Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors ,(2008)

K.M. Knill, M.J.F. Gales, S.J. Young, Use of Gaussian selection in large vocabulary continuous speech recognition using HMMS international conference on spoken language processing. ,vol. 1, pp. 470- 473 ,(1996) , 10.1109/ICSLP.1996.607156

Accelerating hierarchical acoustic likelihood computation on graphics processors.

来源期刊

我的账户

Accelerating hierarchical acoustic likelihood computation on graphics processors.

来源期刊

相似文章 9

我的账户