I have a hierarchical model, where self attention (from the Transformer) is used to encode each word in a sentence, and then another self attention block, that encodes each sent