RLHF

Looking Through the Mirror: Minimax-Optimal Regularized Regrets in Online Learning and Bandits

We revisit regularized regret minimization under full-information and bandit feedback, where a learner optimizes an objective of the …

Junghyun Lee, Yujun Kim, Chulhee Yun, Se-Young Yun

Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences

We consider the problem of regularized best-response max-regret minimization in online RLHF under general preferences and bandit …

Junghyun Lee, Minju Hong, Kwang-Sung Jun, Chulhee Yun, Se-Young Yun