摘要: In environments with uncertain dynamics exploration is necessary to learn how perform well. Existing reinforcement learning algorithms provide strong guarantees, but they tend rely on an ergodicity assumption. The essence of that any state eventually reachable from other by following a suitable policy. This assumption allows for operate simply favoring states have rarely been visited before. For most physical systems this impractical as the would break before reasonable has taken place, i.e., don't satisfy paper we address need safe methods in Markov decision processes. We first propose general formulation safety through ergodicity. show imposing restricting attention resulting set guaranteed policies NP-hard. then present efficient algorithm safe, potentially suboptimal, exploration. At core optimization which constraints restrict subset and objective favors policies. Our framework compatible majority previously proposed methods, bonus. experiments, include Martian terrain problem, our method able explore better than classical methods.