I implemented a simple actor-critic model in Tensorflow==2.3.1 to learn Cartpole environment. But it is not learning at all. The average scores of every 50 episodes is below