作者: Yaman Umuroglu , Nhan Tran , Nicholas J. Fraser , Javier M. Duarte , Benjamin Hawks
DOI:
关键词:
摘要: Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits depending on the application from lower latencies to higher data throughputs more efficient energy consumption. Two popular techniques reducing computation neural networks are pruning, removing insignificant synapses, and quantization, precision of calculations. In this work, we explore interplay between pruning quantization during training ultra low latency applications targeting high physics use cases. However, developed study potential across many other domains. We various configurations quantization-aware training, which term \emph{quantization-aware pruning} effect like regularization, batch normalization, different schemes multiple computational or efficiency metrics. find that yields computationally models than either alone our task. Further, typically performs similar better terms compared standard architecture optimization techniques. While accuracy benchmark may be similar, information content network can vary significantly based configuration.