问题
I have some trouble with my implementation of a deep neural network to the game Pong because my network is always diverging, regardless which parameters I change. I took a Pong-Game and implemented a theano/lasagne based deep-q learning algorithm which is based on the famous nature paper by Googles Deepmind.
What I want:
Instead of feeding the network with pixel data I want to input the x- and y-position of the ball and the y-position of the paddle for 4 consecutive frames. So I got a total of 12 inputs.
I only want to reward the hit, the loss, and the win of a round.
With this configuration, the network did not converge and my agent was not able to play the game. Instead, the paddle drove directly to the top or bottom or repeated the same pattern. So I thought I try to make it a bit easier for the agent and add some information.
What I did:
States:
- x-position of the Ball (-1 to 1)
- y-position of the Ball (-1 to 1)
- normalized x-velocity of the Ball
- normalized y-velocity of the Ball
- y-position of the paddle (-1 to 1)
With 4 consecutive frames I get a total input of 20.
Rewards:
- +10 if Paddle hits the Ball
- +100 if Agent wins the round
- -100 if Agent loses the round
- -5 to 0 for the distance between the predicted end position (y-position) of the ball and the current y-position of the paddle
- +20 if the predicted end position of the ball lies in the current range of the paddle (the hit is foreseeable)
- -5 if the ball lies behind the paddle (no hit possible anymore)
With this configuration, the network still diverges. I tried to play around with the learning rate (0.1 to 0.00001), the nodes of the hidden layers (5 to 500), the number of hidden layers (1 to 4), the batch accumulator (sum or mean), the update rule (rmsprop or Deepminds rmsprop).
All of these did not lead to a satisfactory solution. The graph of the loss averages mostly looks something like this.
You can download my current version of the implementation here
I would be very grateful for any hint :)
Koanashi
回答1:
Repeating my suggestion from comments as an answer now to make it easier to see for anyone else ending up on this page later (was posted as comment first since I was not 100% sure it'd be the solution):
Reducing the magnitude of the rewards to lie in (or at least close to) the [0.0, 1.0] or [-1.0, 1.0] intervals helps the network to converge more quickly.
Changing the reward values in such a way (simply dividing them all by a number to make them lie in a smaller interval) does not change what a network is able to learn in theory. The network could also simply learn the same concepts with larger rewards by finding larger weights throughout the network.
However, learning such large weights typically takes much more time. The main reason for this is that weights are often intialized to random values close to 0, so it takes a lot of time to change those values to large values through training. Because the weights are initialized to small values (typically), and they are very far away from the optimal weight values, this also means that there is an increased risk of there being a local (not a global) minimum along the way to the optimal weight values, which it can get stuck in.
With lower reward values, the optimal weight values are likely to be low in magnitude as well. This means that weights initialized to small random values are already more likely to be close to their optimal values. This leads to a shorter training time (less "distance" to travel to put it informally), and a decreased risk of there being local minima along the way to get stuck in.
来源:https://stackoverflow.com/questions/39371211/finding-the-right-parameters-for-neural-network-for-pong-game