作者: Miroslav Novak , Pavel Kveton
DOI:
关键词:
摘要: pavel.kveton@cz.ibm.com, miroslav@us.ibm.comABSTRACTThe paper presents a method for performance improvements ofa speech recognition system by moving part of the computation -acoustic likelihood - onto Graphics Processor Unit(GPU). In system, GPU operates as low cost powerful co-processor linear algebra operations. The compares GPUimplementation two techniques acoustic computa-tion: fullGaussian all components and significantlyfaster Gaussian selection using hierarchical evaluation. Thefull is an ideal candidate implemen-tation because its matrix multiplication nature. hierarchicalGaussian technique commonly used on aCPU sinceit leads to much better pruning vol-ume. Pruning are generally harder implementon GPUs, nevertheless, shows that Gaussiancomputation can still be done with perfor-mance than full computation.Index Terms— Speech recognition, GPU, Parallelization1. INTRODUCTIONSpeech CPU-intensive tasks. Acoustic likelihoodevaluation usually represents 30-50% computation, whichmakes it good optimization. A lot work has beendedicated increase scoring CPU.One successful methods selection[1, 2] leading 3-4x speed up in comparison Gaussianevaluation without noteworthy accuracy loss. Significant also achieved use Streaming SIMDExtension (SSE) instruction set [2, 3].Graphical processor units (GPUs) represent low-cost computa-tional power been reserved image/video processing tasksfor long time. An introduction General Purpose computationson (GPGPU, e.g. CUDA from NVIDIA [4], ATI Stream fromAMD [5]) made feasible GPUs other Re-cently, OpenCL [6] standard cross-platform parallelprogram-ming developed cooperation many industry-leadingcompanies institutions. Both AMD provide anOpenCL SDK drivers their GPUs.With dawn GPGPU environments graphics processorsbecome affordable coprocessors algebraoperations recognition. Recent works show contribu-tion mainly evaluation, which straight-forward implement due na-ture. Dixon at al. [7] based evalua-tion 4–6x faster on-demand CPU evaluation largevocabulary continuous task (LVCSR). Cardinalat [3] reports 5x thanan SSE-optimized 35%speed LVCSR task. Chong et [8] implemented model-optimized together theViterbi search showing approx. 9x CPU, gener-alized later You [9] comparable results. Shi [10]employs finding clusters Gaussians fMPE discri mi-native training achieving 17x speed-up this task.This implementation both SSE optimized implementations. Al-though requires global clusterswhich not very efficient acceptable combination GPUand usage found 2x incomparison GPU.2. SOFTAW RE AND HARDAW ENVIRONMENTIn we have architecture[4] SIMT (Single Instruction Multiple Thread) model. Among several options, chosen [6]as programming framework platform-independent.The hence uses terminology.An program akernel code segment exe-cuted N-Dimensional range(NDrange) work-items(all work-items execute same code). NDrange splitintoequally-sizedwork-groups. one work-group runtogether streaming multiprocessor where they syn-chronized share small low-latencylocal memory (typi-cally 16kB). There no communication available between differentwork-groups except termination work-itemsin theNDrange.In addition local memory, access toglobal (large high-latency, typically 1GB). por-tion declared constant (read-only,typically 64kB) then cashed .Each split intohalf-warps(two half-warps forma warp) executed together. Optimization ofthe within crucial toachieve best GPU. Basically, recommendedto or possible favor ofglobal however, spaces optimal ac-cess strategy CUDA-enabled chips — if fol-lowed, half-warp parallel re-sulting throughput. Otherwise, isserialized serious impact. isorganized into banks avoid bank conflicts; globalmemory coalescent half-