作者: Dan Alistarh , Milan Vojnovic , Jerry Z. Li , Ryota Tomioka , Demjan Grubic
DOI:
关键词:
摘要: Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to its excellent scalability properties. A fundamental barrier when parallelizing SGD is the high bandwidth cost communicating updates between nodes; consequently, several lossy compresion heuristics been proposed, by which nodes only communicate quantized gradients. Although effective in practice, these do not always guarantee convergence, and it clear whether they can be improved. In this paper, we propose Quantized (QSGD), a family compression schemes for provides convergence guarantees. QSGD allows user smoothly trade off \emph{communication bandwidth} \emph{convergence time}: adjust number bits sent per iteration, at possibly higher variance. We show that trade-off inherent, sense improving past some threshold would violate information-theoretic lower bounds. guarantees convex non-convex objectives, under asynchrony, extended variance-reduced techniques. When applied training deep neural networks image classification automated speech recognition, leads reductions end-to-end time. For example, on 16GPUs, train ResNet152 network full accuracy ImageNet 1.8x faster than full-precision variant.