I was training a Transformer model to convert English sentences to German. After training it for not even for 1 epoch, the loss went down to 0.009. This was