Penalizing side effects using stepwise relative reachability

作者: Shane Legg , Miljan Martic , Victoria Krakovna , Laurent Orseau , Ramana Kumar

DOI:

关键词:

摘要: How can we design safe reinforcement learning agents that avoid unnecessary disruptions to their environment? We show current approaches penalizing side effects introduce bad incentives, e.g. prevent any irreversible changes in the environment, including actions of other agents. To isolate source such undesirable break down penalties into two components: a baseline state and measure deviation from this state. argue some these incentives arise choice baseline, others measure. new variant stepwise inaction based on relative reachability states. The combination choices avoids given while simpler baselines unreachability fail. demonstrate empirically by comparing different combinations set gridworld experiments designed illustrate possible incentives.

参考文章(34)
Alexander S. Klyubin, Daniel Polani, Chrystopher L. Nehaniv, All else being equal be empowered european conference on artificial life. pp. 744- 753 ,(2005) , 10.1007/11553090_75
Oliver Mihatsch, Ralph Neuneier, Risk Sensitive Reinforcement Learning neural information processing systems. ,vol. 49, pp. 1031- 1037 ,(1998) , 10.1023/A:1017940631555
Christoph Salge, Cornelius Glackin, Daniel Polani, Empowerment -- an Introduction arXiv: Artificial Intelligence. pp. 67- 114 ,(2014) , 10.1007/978-3-642-53734-9_4
Fernando Fernández, Javier García, A comprehensive survey on safe reinforcement learning Journal of Machine Learning Research. ,vol. 16, pp. 1437- 1480 ,(2015)
Jeremy H. Gillula, Claire J. Tomlin, Guaranteed Safe Online Learning via Reachability: tracking a ground target using a quadrotor international conference on robotics and automation. pp. 2723- 2730 ,(2012) , 10.1109/ICRA.2012.6225136
Pieter Abbeel, Andrew Y. Ng, Apprenticeship learning via inverse reinforcement learning Twenty-first international conference on Machine learning - ICML '04. pp. 1- 8 ,(2004) , 10.1145/1015330.1015430
Andrew Y Ng, Stuart Russell, None, Algorithms for Inverse Reinforcement Learning international conference on machine learning. ,vol. 67, pp. 663- 670 ,(2000) , 10.2460/AJVR.67.2.323
Brian D. Ziebart, J. Andrew Bagnell, Anind K. Dey, Andrew Maas, Maximum entropy inverse reinforcement learning national conference on artificial intelligence. pp. 1433- 1438 ,(2008)
I.M. Mitchell, A.M. Bayen, C.J. Tomlin, A time-dependent Hamilton-Jacobi formulation of reachable sets for continuous dynamic games IEEE Transactions on Automatic Control. ,vol. 50, pp. 947- 957 ,(2005) , 10.1109/TAC.2005.851439