I am implementing a SARSA reinforcement learning function which chooses an action following the same current policy updates its Q-values.
This throws me the following