Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences

Feb 22, 2026·

Junghyun Lee

Minju Hong

Kwang-Sung Jun

Chulhee Yun

Se-Young Yun

· 0 min read

PDF

Abstract

We consider the problem of regularized best-response max-regret minimization in online RLHF under general preferences and bandit feedback. While various regularizers are utilized to robustify alignment, known polylogarithmic regret guarantees remain heavily specific to KL. To investigate whether such fast rates extend beyond KL, we adopt the Generalized Bilinear Preference Model (GBPM) - capturing intransitive preferences over $d$-dimensional item-wise features via a rank-$2r$ skew-symmetric matrix - to isolate the impact of generic regularization. Crucially, under GBPM, we prove that the dual gap of any greedy policy is bounded by the squared estimation error, derived using only strong convexity and skew-symmetry. Under a feature coverage assumption, we establish polylogarithmic $\tilde{\mathcal{O}}(\eta d^4 (\log T)^2 \wedge d^2 \sqrt{T})$ regret with Greedy Sampling and $\mathrm{poly}(d)$-free $\tilde{\mathcal{O}}(\sqrt{\eta r T} \wedge r^{1/3} T^{2/3})$ regret with Explore-Then-Commit, where $\eta^{-1}$ is the regularization coefficient and $T$ is the time horizon. This demonstrates that ``fast’’ regrets are not KL-specific, but rather a fundamental consequence of generic strongly convex geometry.

Type

Preprint

Publication

arXiv preprint arXiv:2602.23116

Last updated on Feb 22, 2026

RLHF Bandits Statistics Learning in Games

Authors

Junghyun Lee

PhD Candidate in AI

PhD candidate at KAIST AI, jointly advised by Se-Young Yun and Chulhee Yun. I work on interactive machine learning, theoretical aspects of LLMs, learning/optimization theory, and statistical analysis of large networks.

← TESSAR: Geometry-Aware Active Regression via Dynamic Voronoi Tessellation Apr 23, 2026

A Jointly Efficient and Optimal Algorithm for Heteroskedastic Generalized Linear Bandits with Adversarial Corruptions Feb 12, 2026 →

No results found

Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences