作者: Byung-Gon Chun , Yunseong Lee , Soojeong Kim , Gyeong-In Yu , Joo Seong Jeong
DOI:
关键词: Machine learning 、 Titan (supercomputer) 、 Inference 、 Computer science 、 Speedup 、 Transfer of learning 、 Server 、 Multi model inference 、 Artificial intelligence 、 Set (abstract data type)
摘要: Standardized DNN models that have been proved to perform well on machine learning tasks are widely used and often adopted as-is solve downstream tasks, forming the transfer paradigm. However, when serving multiple instances of such from a cluster GPU servers, existing techniques improve utilization as batching inapplicable because do not share weights due fine-tuning. We propose NetFuse, technique merging same architecture but different inputs. NetFuse is made possible by replacing operations with more general counterparts allow set be associated only certain Experiments ResNet-50, ResNeXt-50, BERT, XLNet show can speed up inference time 3.6x NVIDIA V100 GPU, 3.0x TITAN Xp 32 model instances, while using small additional amount memory.