Papers
arxiv:2310.11550

Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

Published on Oct 17, 2023
Authors:
,
,

Abstract

We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback, without prior knowledge on transitions or access to simulators. We introduce two algorithms that achieve improved regret performance compared to existing approaches. The first algorithm, although computationally inefficient, ensures a regret of mathcal{O}left(Kright), where K is the number of episodes. This is the first result with the optimal K dependence in the considered setting. The second algorithm, which is based on the policy optimization framework, guarantees a regret of mathcal{O}left(K^{3{4}} right) and is computationally efficient. Both our results significantly improve over the state-of-the-art: a computationally inefficient algorithm by Kong et al. [2023] with mathcal{O}left(K^{4{5}}+polyleft(1{lambda_{min}}right) right) regret, for some problem-dependent constant lambda_{min} that can be arbitrarily close to zero, and a computationally efficient algorithm by Sherman et al. [2023b] with mathcal{O}left(K^{6{7}} right) regret.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2310.11550 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2310.11550 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2310.11550 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.