8-bit Transformer Inference and Fine-tuning for Edge Accelerators

作者： Jeffrey Yu , Kartik Prabhu , Yonatan Urman , Robert M Radway , Eric Han

DOI:

关键词:

摘要: Transformer models achieve state-of-the-art accuracy on natural language processing (NLP) and vision tasks, but demand significant computation and memory resources, which makes it difficult to perform inference and training (fine-tuning) on edge accelerators. Quantization to lower precision data types is a promising way to reduce computation and memory resources. Prior work has employed 8-bit integer (int8) quantization for Transformer inference, but int8 lacks the precision and range required for training. 8-bit floating-point (FP8) quantization has been used for Transformer training, but prior work only quantizes the inputs to matrix multiplications and leaves the rest of the operations in high precision. This work conducts an in-depth analysis of Transformer inference and fine-tuning at the edge using two 8-bit floating-point data types: FP8 and 8-bit posit (Posit8). Unlike FP8, posit has variable length exponent …

acm.org 本地加速

暂无可下载资源，当前可以选择系统获取到有开放资源时通知我或者直接发起求助文献求助

参考文章(0)

8-bit Transformer Inference and Fine-tuning for Edge Accelerators

来源期刊

我的账户

8-bit Transformer Inference and Fine-tuning for Edge Accelerators

来源期刊

相似文章 0

我的账户