VTA: An Open Hardware-Software Stack for Deep Learning.

作者: Carlos Guestrin , Arvind Krishnamurthy , Tianqi Chen , Thierry Moreau , Luis Ceze

DOI:

关键词:

摘要: Specialized Deep Learning (DL) acceleration stacks, designed for a specific set of frameworks, model architectures, operators, and data types, offer the allure high performance while sacrificing flexibility. Changes in algorithms, models, or numerical systems threaten viability specialized hardware accelerators. We propose VTA, programmable deep learning architecture template to be extensible face evolving workloads. VTA achieves this flexibility via parametrizable architecture, two-level ISA, JIT compiler. The ISA is based on (1) task-ISA that explicitly orchestrates concurrent compute memory tasks (2) microcode-ISA which implements wide variety operators with single-cycle tensor-tensor operations. Next, we runtime system equipped compiler flexible code-generation heterogeneous execution enables effective use architecture. integrated open-sourced into Apache TVM, state-of-the-art compilation stack provides diverse models divergent backends. flow performs design space exploration generate customized software operator library can leveraged by mainstream frameworks. demonstrate our approach deploying optimized used object classification style transfer edge-class FPGAs.

参考文章(15)
Yoshua Bengio, Matthieu Courbariaux, Jean-Pierre David, BinaryConnect: Training Deep Neural Networks with binary weights during propagations arXiv: Learning. ,(2015)
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe, Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines programming language design and implementation. ,vol. 48, pp. 519- 530 ,(2013) , 10.1145/2491956.2462176
James E. Smith, Decoupled access/execute computer architectures ACM SIGARCH Computer Architecture News. ,vol. 10, pp. 112- 119 ,(1982) , 10.1145/1067649.801719
Tianjun Xiao, Tianqi Chen, Chiyuan Zhang, Zheng Zhang, Yutian Li, Min Lin, Minjie Wang, Naiyan Wang, Mu Li, Bing Xu, MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems arXiv: Distributed, Parallel, and Cluster Computing. ,(2015)
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, William J. Dally, EIE: efficient inference engine on compressed deep neural network international symposium on computer architecture. ,vol. 44, pp. 243- 254 ,(2016) , 10.1145/3007787.3001163
Tianqi Chen, Carlos Guestrin, XGBoost: A Scalable Tree Boosting System knowledge discovery and data mining. pp. 785- 794 ,(2016) , 10.1145/2939672.2939785
Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, Tianshi Chen, Cambricon: an instruction set architecture for neural networks international symposium on computer architecture. ,vol. 44, pp. 393- 405 ,(2016) , 10.1145/3007787.3001179
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon, In-Datacenter Performance Analysis of a Tensor Processing Unit international symposium on computer architecture. ,vol. 45, pp. 1- 12 ,(2017) , 10.1145/3079856.3080246
Vikram Adve, Vikram Adve, Lane Schwartz, Richard Wei, DLVM: A modern compiler infrastructure for deep learning systems arXiv: Programming Languages. ,(2017)
Scott Cyphers, Arjun K Bansal, Anahita Bhiwandiwalla, Jayaram Bobba, Matthew Brookhart, Avijit Chakraborty, Will Constable, Christian Convey, Leona Cook, Omar Kanawi, Robert Kimball, Jason Knight, Nikolay Korovaiko, Varun Kumar, Yixing Lao, Christopher R Lishka, Jaikrishnan Menon, Jennifer Myers, Sandeep Aswath Narayana, Adam Procter, Tristan J Webb, None, Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning arXiv: Distributed, Parallel, and Cluster Computing. ,(2018)