• minimax pac bounds on the sample complexity of reinforcement learning with a generative model

    جزئیات بیشتر مقاله
    • تاریخ ارائه: 1392/07/24
    • تاریخ انتشار در تی پی بین: 1392/07/24
    • تعداد بازدید: 1020
    • تعداد پرسش و پاسخ ها: 0
    • شماره تماس دبیرخانه رویداد: -
     we consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward markov decision processes (mdps). we prove new pac bounds on the sample-complexity of two well-known model-based reinforcement learning (rl) algorithms in the presence of a generative model of the mdp: value iteration and policy iteration. the first result indicates that for an mdp with n state-action pairs and the discount factor γ∈[0,1) only o(nlog(n/δ)/((1−γ)3 ε 2)) state-transition samples are required to find an ε-optimal estimation of the action-value function with the probability (w.p.) 1−δ. further, we prove that, for small values of ε, an order of o(nlog(n/δ)/((1−γ)3 ε 2)) samples is required to find an ε-optimal policy w.p. 1−δ. we also prove a matching lower bound of θ(nlog(n/δ)/((1−γ)3 ε 2)) on the sample complexity of estimating the optimal action-value function with ε accuracy. to the best of our knowledge, this is the first minimax result on the sample complexity of rl: the upper bounds match the lower bound in terms of nεδ and 1/(1−γ) up to a constant factor. also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1−γ).

سوال خود را در مورد این مقاله مطرح نمایید :

با انتخاب دکمه ثبت پرسش، موافقت خود را با قوانین انتشار محتوا در وبسایت تی پی بین اعلام می کنم
مقالات جدیدترین رویدادها
مقالات جدیدترین ژورنال ها