Introduction to Reinforcement Learning with Human Feedback (RLHF): A Theoretically Biased Overview

Event

Weekly OptiML Lab Group Meeting

Short summary

In this talk, I will first (somewhat rigorously) introduce the framework of reinforcement learning with human feedback (RLHF). Then I will go over three recent breakthroughs in the analysis and improvement of RLHF. Zhu et al (ICML'23) considers the sample complexity guarantee of RLHF. Rafailov et al (NeurIPS'23) proposes a new algorithm called Direct Policy Optimization (DPO) that has several advantages over the “classical” RLHF. Lastly, Azar et al (2023) proposes a new algorithm called Psi-Preference Policy Optimization (Psi-PO), which is a strict generalization of RLHF and DPO, yet it solves the problem of overfitting of DPO. For all three papers, I will be mainly focused on their theoretical contributions, while trying to be as gentle and concise as possible. I will then conclude with several future directions, mostly theoretical, which align well with our professor’s theoretical research interests.

  • Banghua Zhu, Michael Jordan, and Jiantao Jiao. “Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons.” In ICML 2023.
  • Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” In NeurIPS 2023 Oral.
  • Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. “A General Theoretical Paradigm to Understand Learning from Human Preferences.” In arXiv:2310.12036 2023.
Next