作者: Fernando Fernández , Manuela M Veloso
DOI:
关键词: Policy analysis 、 Computer science 、 Probabilistic logic 、 Reuse 、 Set (psychology) 、 Machine learning 、 Metric (unit) 、 Data science 、 Process (engineering) 、 Artificial intelligence 、 Markov process 、 Ranking
摘要: Abstract: "A past policy provides a bias to guide the exploration of the environment and speed up the learning of a new action policy. The success of this bias depends on whether the past policy is 'similar' to the actual policy or not. In this report, we describe a new algorithm, PRQ-Learning, that reuses a set of past policies to bias the learning of a new one. The past policies are ranked following a similarity metric that estimates how useful is [sic] to reuse each of those past policies. This ranking provides a probabilistic bias for the exploration in the new learning process. Several experiments demonstrate that PRQ-Learning finds a balance between exploitation of the ongoing learned policy, exploration of random actions, and exploration toward the past policies."