
An Exponential Lower Bound for LinearlyRealizable MDPs with Constant Suboptimality Gap
A fundamental question in the theory of reinforcement learning is: suppo...
read it

Is Qlearning Provably Efficient?
Modelfree reinforcement learning (RL) algorithms, such as Qlearning, d...
read it

Qlearning with UCB Exploration is Sample Efficient for InfiniteHorizon MDP
A fundamental question in reinforcement learning is whether modelfree a...
read it

Stochastic Lipschitz QLearning
In an episodic Markov Decision Process (MDP) problem, an online algorith...
read it

Solving Discounted Stochastic TwoPlayer Games with NearOptimal Time and Sample Complexity
In this paper, we settle the sampling complexity of solving discounted t...
read it

NoRegret Exploration in GoalOriented Reinforcement Learning
Many popular reinforcement learning problems (e.g., navigation in a maze...
read it

Improved Sample Complexity for Incremental Autonomous Exploration in MDPs
We investigate the exploration of an unknown environment when no reward ...
read it
A Provably Efficient Sample Collection Strategy for Reinforcement Learning
A common assumption in reinforcement learning (RL) is to have access to a generative model (i.e., a simulator of the environment), which allows to generate samples from any desired stateaction pair. Nonetheless, in many settings a generative model may not be available and an adaptive exploration strategy is needed to efficiently collect samples from an unknown environment by direct interaction. In this paper, we study the scenario where an algorithm based on the generative model assumption defines the (possibly timevarying) amount of samples b(s,a) required at each stateaction pair (s,a) and an exploration strategy has to learn how to generate b(s,a) samples as fast as possible. Building on recent results for regret minimization in the stochastic shortest path (SSP) setting (Cohen et al., 2020; Tarbouriech et al., 2020), we derive an algorithm that requires Õ( B D + D^3/2 S^2 A) time steps to collect the B = ∑_s,a b(s,a) desired samples, in any unknown and communicating MDP with S states, A actions and diameter D. Leveraging the generality of our strategy, we readily apply it to a variety of existing settings (e.g., model estimation, pure exploration in MDPs) for which we obtain improved samplecomplexity guarantees, and to a set of new problems such as beststate identification and sparse reward discovery.
READ FULL TEXT
Comments
There are no comments yet.