I had a small but potentially stupid question about Monte Carlo Tree Search. I understand most of it but have been looking at some implementations and noticed that after the MCT
Well the reason may be the following.
Rollouts are truncated value estimations, contribution after maximum length are discarded.
Assume that maximum rollout depth is N.
If you consider an environment where average reward is !=0 (let's say >0).
After an action is taken and observation is obtained a child node of the tree could be selected.
Now the maximum length of the branches and the maximum length of the rollout that partecipated to the evaluation of a node value is N-1, as the root node has been discarded.
However, the new simulations will obviously still have length N but they will have to be combined with simulations of length N-1.
Longer simulations will have a biased value as the average reward is !=0
This means that the nodes are evaluated with mixed length evaluation will have a bias depending on the ratio of simulations with different lengths..
Another reason why recycling old simulations with shorter length is avoided is because of the bias induced on the sampling. Just imagine a T maze where at depth d on the left there is a maximum reward =R/2 while at depth=d+1 there is a maximum reward = R on the right. All the paths to the left that during the first step were able to reach the R/2 reward at depth d will be favoured during the second step with a recycled tree while paths to the right will be less common and there will higher chance to not reach the reward R. Starting from an empty tree will give the same probability to both sides of the maze.
Alpha Go Zero (see Peter de Rivaz's answer) actually does not use rollouts but uses a value approaximation (generated by a deep network). values are not truncated estimations. Thus Alpha Go Zero is not affected by this branch length bias.
Alpha Go, the predecessor of Alpha Go Zero, combined rollouts and the value approximation and also reused the tree.. but no the new version does not use the rollouts.. maybe for this reason. Also both Alpha Go Zero and Alpha Go do not use the value of the action but the number of times it was selected during search. This value may be less affected by the length bias, at least in the case where the average reward is negative
Hope this is clear..