RLHF

Looking Through the Mirror: Minimax-Optimal Regularized Regrets in Online Learning and Bandits

We revisit regularized regret minimization under full-information and bandit feedback, where a learner optimizes an objective of the form $\langle r, \pi \rangle - \eta^{-1} …

Junghyun Lee

• May 21, 2026 • 1 min read

RLHF

Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences

We consider the problem of *regularized* best-response max-regret minimization in online RLHF under general preferences and bandit feedback. While various regularizers are utilized …

Junghyun Lee

• Feb 22, 2026 • 1 min read

No results found

RLHF

Looking Through the Mirror: Minimax-Optimal Regularized Regrets in Online Learning and Bandits

Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences