Introduction to Reinforcement Learning with Human Feedback (RLHF): A Theoretically Biased Overview

Event

Weekly OptiML Lab Group Meeting

Short summary

In this talk, I will first (somewhat rigorously) introduce the framework of reinforcement learning with human feedback (RLHF). Then I will go over three recent breakthroughs in the analysis and improvement of RLHF. Zhu et al (ICML'23) considers the sample complexity guarantee of RLHF. Rafailov et al (NeurIPS'23) proposes a new algorithm called Direct Policy Optimization (DPO) that has several advantages over the “classical” RLHF. Lastly, Azar et al (2023) proposes a new algorithm called Psi-Preference Policy Optimization (Psi-PO), which is a strict generalization of RLHF and DPO, yet it solves the problem of overfitting of DPO. For all three papers, I will be mainly focused on their theoretical contributions, while trying to be as gentle and concise as possible. I will then conclude with several future directions, mostly theoretical, which align well with our professor’s theoretical research interests.

Banghua Zhu, Michael Jordan, and Jiantao Jiao. “Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons.” In ICML 2023.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” In NeurIPS 2023 Oral.
Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. “A General Theoretical Paradigm to Understand Learning from Human Preferences.” In arXiv:2310.12036 2023.