• an asymptotically optimal policy for finite support models in the multiarmed bandit problem

    جزئیات بیشتر مقاله
    • تاریخ ارائه: 1392/07/24
    • تاریخ انتشار در تی پی بین: 1392/07/24
    • تعداد بازدید: 942
    • تعداد پرسش و پاسخ ها: 0
    • شماره تماس دبیرخانه رویداد: -
     in the multiarmed bandit problem the dilemma between exploration and exploitation in reinforcement learning is expressed as a model of a gambler playing a slot machine with multiple arms. a policy chooses an arm in each round so as to minimize the number of times that arms with suboptimal expected rewards are pulled. we propose the minimum empirical divergence (med) policy and derive an upper bound on the finite-time regret which meets the asymptotic bound for the case of finite support models. in a setting similar to ours, burnetas and katehakis have already proposed an asymptotically optimal policy. however, we do not assume any knowledge of the support except for its upper and lower bounds. furthermore, the criterion for choosing an arm, minimum empirical divergence, can be computed easily by a convex optimization technique. we confirm by simulations that the med policy demonstrates good performance in finite time in comparison to other currently popular policies.

سوال خود را در مورد این مقاله مطرح نمایید :

با انتخاب دکمه ثبت پرسش، موافقت خود را با قوانین انتشار محتوا در وبسایت تی پی بین اعلام می کنم
مقالات جدیدترین ژورنال ها