I\'m trying to implement the TensorFlow version of this gist about reinforcement learning. Based on comments, it uses binary cross entropy from logits. I tried to use