QLearning network in a custom environment is choosing the same action every time, despite the heavy negative reward
问题 So I plugged QLearningDiscreteDense into a dots and boxes game I made. I created a custom MDP environment for it. The problem is that it chooses action 0 each time, the first time it works but then it's not an available action anymore so it's an illegal move. I give illegal moves a reward of Integer.MIN_VALUE , but it doesn't affect anything. Here's the MDP class: public class testEnv implements MDP<testState, Integer, DiscreteSpace> { final private int maxStep; DiscreteSpace actionSpace =