作者: Hongbin Zha , Zhouchen Lin , Wenwu Ou , Chen Xu , Zhirong Wang
DOI:
关键词: Quantization (signal processing) 、 Optimization problem 、 Computer science 、 Binary code 、 Algorithm 、 Recurrent neural network 、 Feedforward neural network 、 Contextual image classification 、 Inference
摘要: Recurrent neural networks have achieved excellent performance in many applications. However, on portable devices with limited resources, the models are often too large to deploy. For applications on the server with large scale concurrent requests, the latency during inference can also be very critical for costly computing resources. In this work, we address these problems by quantizing the network, both weights and activations, into multiple binary codes {-1,+ 1}. We formulate the quantization as an optimization problem …