Timeslot | Speaker | Schedule/Topic of the Talk | |
---|---|---|---|
09:30 - 10:20 | Free Networking, Coffee Break | ||
10:20 - 10:30 | Junghyun Lee PhD Student KAIST AI |
Opening Remarks | |
Session #1. (Chair: Se-Young Yun) | |||
10:30 - 11:10 | Dabeen Lee Assistant Professor KAIST ISysE |
Reinforcement Learning for Infinite-Horizon Average-Reward MDPs with Multinomial Logistic Function ApproximationSlides |
|
11:10 - 11:20 | Short Break | ||
11:20 - 12:00 | Yeoneung Kim Assistant Professor SeoulTech AAI |
Approximate Thompson Sampling for Learning Linear Quadratic Regulators with $O(\sqrt{T})$ RegretSlides |
|
12:00 - 14:00 | Lunchtime + Coffee + Free Discussions (Lunch: 교직원식당 @KAIST Seoul Campus — 예약확정) |
||
Session #2. (Chair: Min-hwan Oh) | |||
14:00 - 14:40 | Kwang-Sung Jun Assistant Professor Univ. of Arizona CS |
Noise-Adaptive Confidence Sets for Linear BanditsSlides |
|
14:40 - 14:50 | Short Break | ||
14:50 - 15:30 | Se-Young Yun Associate Professor KAIST AI |
Confidence Set Analysis in Preference FeedbackSlides |
|
15:30 - 16:00 | Long (Coffee) break | ||
Session #3. (Chair: Dabeen Lee) | |||
16:00 - 16:40 | Min-hwan Oh Assistant Professor SNU DS |
Lasso Bandit with Compatibility Condition on Optimal ArmSlides |
|
16:40 - 16:50 | Short Break | ||
16:50 - 17:30 | Kyungjae Lee Assistant Professor CAU AI |
Optimal Algorithms for Multi-Armed Bandits with Heavy-Tailed RewardsSlides |
|
17:30 - 17:45 | Junghyun Lee PhD Student KAIST AI |
Closing Remarks |
We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logistic (MNL) model, and we consider the infinite-horizon average reward setting. Our first algorithm $UCRL2-MNL$ applies to the class of communicating MDPs and achieves an $\tilde{O}(dD\sqrt{T})$ regret, where $d$ is the dimension of feature mapping, $D$ is the diameter of the underlying MDP, and $T$ is the horizon. The second algorithm $OVIFH-MNL$ is computationally more efficient and applies to the more general class of weakly communicating MDPs, for which we show a regret guarantee of $\tilde{O}(d^{2/5} \mathrm{sp}(v^*)T^{4/5})$ where $\mathrm{sp}(v^*)$ is the span of the associated optimal bias function. We also prove a lower bound of $\Omega(d\sqrt{DT})$ for learning communicating MDPs with diameter at most $D$. Furthermore, we show a regret lower bound of $\Omega(dH^{3/2}\sqrt{K})$ for learning $H$-horizon episodic MDPs with MNL function approximation where $K$ is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting. This is based on joint work with Jaehyun Park (KAIST ISysE).
Dabeen Lee is an Assistant Professor in the Department of Industrial and Systems Engineering (ISE) at KAIST. He obtained his Ph.D. in the Algorithms, Combinatorics, and Optimization (ACO) program at the Tepper School of Business at Carnegie Mellon University. He received the IBS Young Scientist Fellowship (YSF) and won Second Place in the INFORMS Optimization Society Best Student Paper Prize Competition in 2019. His research interests are in designing algorithms and mathematical programming frameworks for broad areas of optimization and machine learning.
We propose an approximate Thompson sampling algorithm that learns linear quadratic regulators (LQR) with an improved Bayesian regret bound of $O(\sqrt{T})$.
Our method leverages Langevin dynamics with a meticulously designed preconditioner as well as a simple excitation mechanism.
We show that the excitation signal induces the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process.
Moreover, we identify nontrivial concentration properties of the approximate posteriors generated by our algorithm.
These properties enable us to bound the moments of the system state and attain an $O(\sqrt{T})$ regret bound without the unrealistic restrictive assumptions on parameter sets that are often used in the literature.
This is joint work with Gihun Kim (SNU ECE) and Insoon Yang (SNU ECE), and is based on the recent preprint of the same title.
Yeoneung Kim is an Assistant Professor in the Department of Applied Artificial Intelligence at SeoulTech. He received a B.S. in Mathematics from POSTECH and a Ph.D. in Mathematics from the University of Wisconsin-Madison in 2011 and 2019. Before joining SeoulTech, he was a Researcher at the National Institute for Mathematical Sciences (2013 - 2016), a Staff Engineer at Samsung Electronics (2019-2021), a BK21 Postdoctoral Researcher at SNU (2021 - 2022), and an Assistant Professor in the Department of Financial Mathematics at Gachon University (2022-2023). His research focuses on optimal control and reinforcement learning.
Adapting to a priori unknown noise level is a very important but challenging problem in sequential decision-making
as efficient exploration typically requires knowledge of the noise level, which is often loosely specified.
We report significant progress in addressing this issue in linear bandits in two respects.
First, we propose a novel confidence set that is `semi-adaptive' to the unknown sub-Gaussian parameter
$\sigma^2_*$ where the confidence width w.r.t. the dimension d scales with $\sigma^2_*$ rather than the specified sub-Gaussian parameter $\sigma^2_0$ that can be much larger than $\sigma^2_*$.
In particular, when the true noise level is zero, this leads to removing a factor of $\sqrt{d}$ from the confidence width and the regret bound when used for linear bandits.
Second, for bounded rewards, we propose a novel variance-adaptive confidence set that has a much improved numerical performance upon prior art while removing unrealistic assumptions.
We then apply this confidence set to develop, as we claim, the first practical variance-adaptive linear bandit algorithm via an optimistic approach, which is enabled by our novel regret analysis technique.
Both of our confidence sets rely critically on `regret equality' from online learning.
Our empirical evaluation in Bayesian optimization tasks shows that our algorithms demonstrate better or comparable performance compared to existing methods.
This is joint work with Jungtaek Kim (Univ. of Pittsburg) and is based on the recently accepted ICML 2024 paper of the same title.
Kwang-Sung Jun is an Assistant Professor in the Department of Computer Science at the University of Arizona. His research interest is interactive machine learning, especially theoretical aspects of bandit problems, online learning, and confidence bounds.
Learning from Human Feedback has recently garnered considerable interest, especially with the advent of ChatGPT, which aligns with human preferences through Reinforcement Learning from Human Feedback (RLHF).
In the context of the Bradley-Terry model, feedback is modeled as a logistic bandit problem, a common approach for capturing user preferences, such as click versus no-click in advertisement recommendation systems.
Previous research often neglects dependencies in $S$, the norm of the unknown parameter vector, which becomes a significant issue when $S$ is large.
We first introduce Regret-to-Confidence Set Conversion (R2CS), which constructs a convex, Likelihood Ratio-based Confidence Set solely based on the existence of an online learning algorithm with a regret guarantee.
The novel confidence sets result in the optimal regret for the logistic bandits.
This is joint work with Junghyun Lee (KAIST AI) and Kwang-Sung Jun (Univ. of Arizona CS), and is based on the following two papers:
Se-Young Yun is an Associate Professor at the Graduate School of AI, KAIST, in Seoul, South Korea. He holds a Ph.D. in Electrical Engineering from KAIST in 2012, where he also completed his B.S in 2006 summa cum laude. Before joining KAIST, he was a researcher at KTH, Inria, Microsoft Research Cambridge, and Los Alamos National Lab. His research focuses on machine learning, particularly on the dynamics of evolving knowledge, enhancing reasoning in small models, and improving instruction robustness. He has been recognized with multiple awards, including Outstanding Reviewer Award at NIPS 2016 and Best Paper Award at ACM MobiHoc 2013.
We study a stochastic sparse linear bandit problem where only a sparse subset of context features affects the expected reward function, i.e., the unknown reward parameter has sparse structure. In the existing Lasso bandit literature, the compatibility conditions together with additional diversity conditions on the context features are imposed to achieve regret bounds that only depend logarithmically on the ambient dimension $d$. In this paper, we demonstrate that even without the additional diversity assumptions, the compatibility condition only on the optimal arm is sufficient to derive a regret bound that depends logarithmically on $d$, and our assumption is strictly weaker than those used in the lasso bandit literature under the single parameter setting. To our knowledge, the proposed algorithm requires the weakest assumptions among Lasso bandit algorithms under a single parameter setting that achieve $O(\mathrm{poly} \log dT)$ regret. This is a joint work with Harin Lee and Taehyun Hwang from SNU and is based on the recent preprint of the same title.
Min-hwan Oh is an Assistant Professor in the Graduate School of Data Science at Seoul National University. His research interests are in sequential decision making under uncertainty including bandit algorithms and reinforcement learning, statistical machine learning, and optimization.
In this presentation, we will explore significant advancements in multi-armed bandits with heavy-tailed rewards. The first paper, "Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy-Tailed Rewards (NeurIPS 2020)," proposes an adaptive perturbation method combined with a p-robust estimator, offering both theoretical and empirical improvements over existing methods. The second paper, "Minimax Optimal Bandits for Heavy-Tail Rewards (TNNLS 2024)," introduces a robust estimator and a perturbation-based exploration strategy that achieve minimax optimal regret bounds without prior knowledge of reward distribution bounds. Recently, I extended these approaches to distributional settings, further enhancing their applicability and performance.
Kyungjae Lee is currently an Assistant Professor in the Department of Artificial Intelligence at Chung-Ang University. His current research interests include multi-armed bandit, Bayesian optimization, reinforcement learning, and its application.
Junghyun Lee [jh_lee00@kaist.ac.kr]
I am a PhD student at OSI Lab and OptiML Lab at KAIST AI with broad interests in mathematical and theoretical AI.
Some of my recent interests include obtaining tight statistical guarantees for interactive machine learning (RL, bandits), exploring fundamental concepts in statistics and probability theory, algorithmic fairness, deep learning theory from both optimization and statistical perspectives, and essentially any machine learning challenges necessitating mathematical analysis.
If you are interested in talking about anything (research, workshop, etc), feel free to contact me! Also, I'm planning to make this into a series of workshops! If you are interested, please let me know!
서울특별시 동대문구 회기로 85 (한국과학기술원(KAIST) 홍릉캠퍼스) 1호관 1114호
주차는 무료입니다.