作者: Shivam Garg , Samuele Tosatto , Yangchen Pan , Martha White , A Rupam Mahmood
DOI:
关键词:
摘要: Policy gradient (PG) estimators are ineffective in dealing with softmax policies that are sub-optimally saturated, which refers to the situation when the policy concentrates its probability …