I am training a transformers-based machine translation (NMT) model.
The size of the parallel corpus is 4.5 million sentence pairs in two languages. What I am observin