Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences
We consider the problem of *regularized* best-response max-regret minimization in online RLHF under general preferences and bandit feedback. While various regularizers are utilized …