作者: Jacob Eisenstein , Jonathan Berant , Chirag Nagpal , Alekh Agarwal , Ahmad Beirami
DOI:
关键词:
摘要: Reward models play a key role in aligning language model applications towards human preferences. However, this setup can create a dynamic in which the policy model has the incentive to exploit errors in the reward model to achieve high reward. This means that the success of reward-based alignment depends on the ability of reward models to transfer to new distributions created by the aligned policy model. We show that reward models are \emph{underspecified}, in the sense that models that perform similarly in-distribution can yield very different rewards on policy model outputs. These differences propagate to the aligned policies, which we show to be heavily influenced by the random seed used during \emph{pretraining} of the reward model. We show that even a simple alignment strategy --- best-of- reranking --- creates a semi-adversarial dynamic between the policy and reward models, promoting outputs on which the reward models are more likely to disagree. Finally, we show that a simple ensembling strategy can help to address this issue.