Meta-Learning with Warped Gradient Descent

作者: Razvan Pascanu , Hujun Yin , Raia Hadsell , Francesco Visin , Andrei A. Rusu

DOI:

关键词:

摘要: Learning an efficient update rule from data that promotes rapid learning of new tasks the same distribution remains open problem in meta-learning. Typically, previous works have approached this issue either by attempting to train a neural network directly produces updates or learn better initialisations scaling factors for gradient-based rule. Both these approaches pose challenges. On one hand, producing forgoes useful inductive bias and can easily lead non-converging behaviour. other try control typically resort computing gradients through process obtain their meta-gradients, leading methods not scale beyond few-shot task adaptation. In work, we propose Warped Gradient Descent (WarpGrad), method intersects mitigate limitations. WarpGrad meta-learns efficiently parameterised preconditioning matrix facilitates gradient descent across distribution. Preconditioning arises interleaving non-linear layers, referred as warp-layers, between layers task-learner. Warp-layers are meta-learned without backpropagating training manner similar produce updates. is computationally efficient, easy implement, arbitrarily large meta-learning problems. We provide geometrical interpretation approach evaluate its effectiveness variety settings, including few-shot, standard supervised, continual reinforcement learning.

参考文章(67)
Ilya Sutskever, Geoffrey Hinton, James Martens, George Dahl, On the importance of initialization and momentum in deep learning international conference on machine learning. pp. 1139- 1147 ,(2013)
James Martens, Deep learning via Hessian-free optimization international conference on machine learning. pp. 735- 742 ,(2010)
Geoffrey E Hinton, David C Plaut, Using fast weights to deblur old memories conference cognitive science. pp. 177- 186 ,(1987)
Sebastian Thrun, Lorien Pratt, Learning to Learn: Introduction and Overview Learning to Learn. pp. 3- 17 ,(1998) , 10.1007/978-1-4615-5529-2_1
Christian Szegedy, Sergey Ioffe, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift international conference on machine learning. ,vol. 1, pp. 448- 456 ,(2015)
Shun-ichi Amari, Natural gradient works efficiently in learning Neural Computation. ,vol. 10, pp. 177- 202 ,(1998) , 10.1162/089976698300017746
Amir Beck, Marc Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization Operations Research Letters. ,vol. 31, pp. 167- 175 ,(2003) , 10.1016/S0167-6377(02)00231-6
R French, Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences. ,vol. 3, pp. 128- 135 ,(1999) , 10.1016/S1364-6613(99)01294-2
Sepp Hochreiter, Jürgen Schmidhuber, Long short-term memory Neural Computation. ,vol. 9, pp. 1735- 1780 ,(1997) , 10.1162/NECO.1997.9.8.1735