摘要: Recent research has demonstrated that human-generated reward signals can be effectively used to train agents perform a range of reinforcement learning tasks. Such tasks are either episodic - i.e., conducted in unconnected episodes activity often end goal or failure states continuing indefinitely ongoing. Another point difference is whether the agent highly discounts value future myopic conversely values appreciably. In recent work, we found previous approaches from human all valuation [7]. This study additionally provided evidence for desirability task domains both goal-based and episodic.In this paper, conduct three user studies examine critical assumptions our research: episodicity, optimal behavior with respect Markov Decision Process, lack state task. first experiment, show converting simple non-episodic (i.e., continuing) resolves some theoretical issues present generally positive relatedly enables successful non-myopic multiple studies. The primary algorithm which call "VI-TAMER", it successfully learn non-myopically reward; also empirically such facilitates higher-level understanding Anticipating complexity real-world problems, two subsequent one added compare (1) when updated asynchronously local bias quickly reachable agent's current more than other (2) fully synchronous sweeps across each VI-TAMER algorithm. With these locally biased updates, find general positivity creates problems even tasks, revealing distinct challenge work.