Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference.

作者： Yaman Umuroglu , Nhan Tran , Nicholas J. Fraser , Javier M. Duarte , Benjamin Hawks

DOI:

关键词:

摘要: Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits depending on the application from lower latencies to higher data throughputs more efficient energy consumption. Two popular techniques reducing computation neural networks are pruning, removing insignificant synapses, and quantization, precision of calculations. In this work, we explore interplay between pruning quantization during training ultra low latency applications targeting high physics use cases. However, developed study potential across many other domains. We various configurations quantization-aware training, which term \emph{quantization-aware pruning} effect like regularization, batch normalization, different schemes multiple computational or efficiency metrics. find that yields computationally models than either alone our task. Further, typically performs similar better terms compared standard architecture optimization techniques. While accuracy benchmark may be similar, information content network can vary significantly based configuration.

参考文章(59)

Vincent Vanhoucke, Andrew Senior, Mark Z. Mao, Improving the speed of neural networks on CPUs hgpu.org. ,(2011)

Donald R. Jones, Matthias Schonlau, William J. Welch, Efficient Global Optimization of Expensive Black-Box Functions Journal of Global Optimization. ,vol. 13, pp. 455- 492 ,(1998) , 10.1023/A:1008306431147

Geoffrey E. Hinton, Vinod Nair, Rectified Linear Units Improve Restricted Boltzmann Machines international conference on machine learning. pp. 807- 814 ,(2010)

Yunchao Gong, Lubomir D. Bourdev, Liu Liu, Ming Yang, Compressing Deep Convolutional Networks using Vector Quantization arXiv: Computer Vision and Pattern Recognition. ,(2014)

Michael A. Osborne, Bayesian Gaussian processes for sequential prediction, optimisation and quadrature University of Oxford. ,(2010)

CE Shennon, Warren Weaver, A mathematical theory of communication Bell System Technical Journal. ,vol. 27, pp. 379- 423 ,(1948) , 10.1002/J.1538-7305.1948.TB01338.X

Andrew Y. Ng, Feature selection, L1 vs. L2 regularization, and rotational invariance Twenty-first international conference on Machine learning - ICML '04. pp. 78- ,(2004) , 10.1145/1015330.1015435

Yann LeCun, John Denker, Sara Solla, None, Optimal Brain Damage neural information processing systems. ,vol. 2, pp. 598- 605 ,(1989)

Yoshua Bengio, Xavier Glorot, Antoine Bordes, Deep Sparse Rectifier Neural Networks international conference on artificial intelligence and statistics. ,vol. 15, pp. 315- 323 ,(2011)

10.

Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, Jian Cheng, Quantized Convolutional Neural Networks for Mobile Devices computer vision and pattern recognition. pp. 4820- 4828 ,(2016) , 10.1109/CVPR.2016.521

Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference.

来源期刊

我的账户

Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference.

来源期刊

相似文章 3

hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices.

A Survey of Quantization Methods for Efficient Neural Network Inference.

Nanosecond machine learning event classification with boosted decision trees in FPGA for high energy physics.

我的账户