问题
I programmed a very easy game which works the following way:
Given an 4x4 field of squares, a player can move (up, right, down or left).
Going on a square the agent never visited before gives the reward 1.
Stepping on "dead-field" is rewarded with -5 and then the game will be resetted.
Moving on a field that was already visited is rewarded with -1
Going on the "win-field" (there's exactly one) gives the reward 5 and the game will be resetted as well.
Now I want an AI to learn to play that game via Q-Learning.
How I organized the Inputs / feature engineering:
An input for the net is an array with the shape 1x4 where arr[0] represents the field above (when moving up), arr[1] represents the field to the right, arr[2] the one below, arr[3] the one to the left.
Possible values the array can hold: 0, 1, 2, 3
0 = "dead field", so the worst case
1 = this would be outside of the 4x4 field (so you can't step there) or the field was already visited
2 = unvisited field (so that is something good)
3 = "win field", so the best-case
As you see, I ordered them by their reward.
Since the game takes an input the same way (0 = move up, 1 = move to the right, 2 = move down, 3 = move to the left), the only thing the AI would have to learn is basically: Choose the array index that holds the greatest value.
But unforntunately it doesn't work, the net just doesn't learn, not even after 30.000 episodes.
Here's my code (including the game at the beginning):
import numpy as np
import random
Import tensorflow as tf
import matplotlib.pyplot as plt
from time import sleep
episoden = 0
felder = []
schon_besucht = []
playerx = 0
playery = 0
grafik = False
def gib_zustand():
# besonderes feature engineering:
# input besteht nur aus einer richtung, die one-hot-encoded ist; also 4 inputneuronen
# (glut, wand/besucht, unbesucht, sieg)
#
# es ist die richtung, die bewertet werden soll (also 1 outputneuron fuer eine richtung)
# rueckgabe hier: array, shape: 4x4 (s.o.)
global playerx
global playery
# oben
if playery == 0:
oben = 1
else:
oben = felder[playery-1][playerx]
# rechts
if playerx == 4:
rechts = 1
else:
rechts = felder[playery][playerx+1]
# unten
if playery == 4:
unten = 1
else:
unten = felder[playery+1][playerx]
# links
if playerx == 0:
links = 1
else:
links = felder[playery][playerx-1]
return np.array([oben, rechts, unten, links])
def grafisch():
if grafik:
# encoding:
# glut = G, besucht = b, unbesucht = , sieg = S, Spieler = X
global felder
global playerx
global playery
print('')
for y in range(0,5):
print('|', end='')
for x in range(0,5):
if felder[y][x] == 0:
temp = 'G'
if felder[y][x] == 1:
temp = 'b'
if felder[y][x] == 2:
temp = ' '
if felder[y][x] == 3:
temp = 'S'
if y == playery and x == playerx:
temp = 'X'
print(temp, end='')
print('|', end='')
print('')
def reset():
print('--- RESET ---')
global playery
global playerx
global felder
global schon_besucht
playerx = 1
playery = 3
# anordnung
# glut = 0, wand/besucht = 1, unbesucht = 2, sieg = 3
felder = [[2 for x in range(0,5)] for y in range(0,5)]
# zwei mal glut setzen
gl1 = random.randint(1,3)
gl1_1 = random.randint(2,3) if gl1==3 else (random.randint(1,2) if gl1==1 else random.randint(1,3))
felder[gl1][gl1_1] = 0 # glut
# zweites mal
gl1 = random.randint(1,3)
gl1_1 = random.randint(2,3) if gl1==3 else (random.randint(1,2) if gl1==1 else random.randint(1,3))
felder[gl1][gl1_1] = 0 # glut
# pudding
felder[1][3] = 3
# ruecksetzen
schon_besucht = []
grafisch()
return gib_zustand()
def step(zug):
# 0 = oben, 1 = rechts, 2 = unten, 3 = links
global playerx
global playery
global felder
global schon_besucht
if zug == 0:
if playery != 0:
playery -= 1
if zug == 1:
if playerx != 4:
playerx += 1
if zug == 2:
if playery != 4:
playery += 1
if zug == 3:
if playerx != 0:
playerx -= 1
# belohnung holen
wert = felder[playery][playerx]
if wert==0:
belohnung = -5
if wert==1:
belohnung = -1
if wert==2:
belohnung = 1
if wert==3:
belohnung = 5
# speichern wenn nicht verloren
if belohnung != -5:
schon_besucht.append((playery,playerx))
felder[playery][playerx] = 1
grafisch()
return gib_zustand(), belohnung, belohnung==5, 0 # 0 damits passt
episoden = 0
tf.reset_default_graph()
#These lines establish the feed-forward part of the network used to choose actions
inputs1 = tf.placeholder(shape=[1,4],dtype=tf.float32)
#W1 = tf.Variable(tf.random_uniform([16,8],0,0.01))
W2 = tf.Variable(tf.random_uniform([4,4],0,0.01))
#schicht2 = tf.matmul(inputs1,W1)
Qout = tf.matmul(inputs1,W2)
predict = tf.argmax(Qout,1)
#Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
updateModel = trainer.minimize(loss)
init = tf.initialize_all_variables()
# Set learning parameters
y = .99
e = 0.1
num_episodes = 10_000
#create lists to contain total rewards and steps per episode
jList = []
rList = []
with tf.Session() as sess:
sess.run(init)
for i in range(num_episodes):
#Reset environment and get first new observation
s = reset()
rAll = 0
d = False
j = 0
#The Q-Network
while j < 99:
j+=1
#Choose an action by greedily (with e chance of random action) from the Q-network
a,allQ = sess.run([predict,Qout],feed_dict={inputs1:s.reshape(1,4)}) # berechnet prediction fuer input (input scheint hier one hot encoded zu sein)
if np.random.rand(1) < e:
a[0] = random.randint(0,3)
#Get new state and reward from environment
s1,r,d,_ = step(a[0])
#Obtain the Q' values by feeding the new state through our network
Q1 = sess.run(Qout,feed_dict={inputs1:s1.reshape(1,4)})
#Obtain maxQ' and set our target value for chosen action.
maxQ1 = np.max(Q1)
targetQ = allQ
targetQ[0,a[0]] = r + y*maxQ1
#Train our network using target and predicted Q values
_,W1 = sess.run([updateModel,W2],feed_dict={inputs1:s.reshape(1,4),nextQ:targetQ})
rAll += r
s = s1
if r == -5 or r == 5:
if r == 5:
episoden+=1
reset()
#Reduce chance of random action as we train the model.
e = 1./((i/50) + 10)
break
jList.append(j)
#print(rAll)
rList.append(rAll)
print("Percent of succesful episodes: " + str((episoden/num_episodes)*100) + "%")
plt.plot(rList)
plt.plot(jList)
I read in a simular question, that a reason for too high Q-Values can be, that it is in fact possible for the agent to get unlimited high total rewards in a game. That would be the case here, if the agent could step on already visited fields and would get a reward of 1. Then of course, the possible total reward would be infinity. But that isn't the case here: The player gets a bad reward (-1) when he does that. Little calculation: win-field is rewarded with 5. Unvisited field with 1. There's at least one dead-field. All in all there are 16 fields. Max possible total reward: 14*1 + 1*5 = 19
回答1:
I finally found a solution for that, it took me almost a week.
The key was to have my inputs being one-hot-encoded. This makes me have 16 input-neurons instead of 4, but it works now. After 1.000 episodes I have mostly around 91 % successfull episodes.
I'm still wondering about the fact, that it didn't work when the input wasn't one-hot-encoded. I know that an ANN will automatically use the greater-smaller relations between the different inputs a neuron takes, what can be a disadvantage. But since I orderd my inputs that way, that if one input is greater than a different one, that also means, that the output should be greater the same way. So there is no disadvantage here if the ANN uses the relations, contrarily, that should be an advantage.
Therefore I thought it would be good to not one-hot-encode the inputs because that way I reduce the dimensionality immensely (4 instead of 16).
Apparently that idea didn't work though.
However, as I said, with 16 Inputs it works now.
来源:https://stackoverflow.com/questions/61162864/reinforcement-learning-doesnt-work-for-this-very-easy-game-why-q-learning