I\'m using a BERT tokenizer over a large dataset of sentences (2.3M lines, 6.53bn words):
#creating a BERT tokenizer tokenizer = BertTokenizer.from_pretrained(\'b