I am trying to train a reinforcement learning model for collision avoidance using stable baselines. I am using a custom policy network. I have trained several models and they ha