Epsilon and learning rate decay in epsilon greedy q learning

后端未结

关注

 2  1517

执笔经年 2021-02-08 10:51

I understand that epsilon marks the trade-off between exploration and exploitation. At the beginning, you want epsilon to be high so that you take big leaps and learn things. As

2条回答

既然无缘 (楼主)

2021-02-08 11:31

As the answer of Vishma Dias described learning rate [decay], I would like to elaborate the epsilon-greedy method that I think the question implicitly mentioned a decayed-epsilon-greedy method for exploration and exploitation.

One way to balance between exploration and exploitation during training RL policy is by using the epsilon-greedy method. For example, $epsilon$ =0.3 means with a probability=0.3 the output action is randomly selected from the action space, and with probability=0.7 the output action is greedily selected based on argmax(Q).

An improved of the epsilon-greedy method is called a decayed-epsilon-greedy method. In this method, for example, we train a policy with totally N epochs/episodes (which depends on the problem specific), the algorithm initially sets $epsilon$ = (e.g., =0.6), then gradually decreases to end at $epsilon$ = $pend$ (e.g., $pend$ =0.1) over $nstep$ training epoches/episodes. Specifically, at the initial training process, we let the model more freedom to explore with a high probability (e.g.,=0.6), and then gradually decrease the $epsilon$ with a rate r over training epochs/episodes with the following formula:

$rate$

With this more flexible choice to end at the very small exploration probability $pend$ , after $nstep$ the training process will focus more on exploitation (i.e., greedy) while it still can explore with a very small probability when the policy is approximately converged.

You can see the advantage of the decayed-epsilon-greedy method in this post.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...