Learning in Games

Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences

We consider the problem of *regularized* best-response max-regret minimization in online RLHF under general preferences and bandit feedback. While various regularizers are utilized …

avatar
Junghyun Lee