How to stop BERT from breaking apart specific words into word-piece
问题 I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any solution to it? For example: tokenizer = BertTokenizer('bert-base-uncased-vocab.txt') tokens = tokenizer.tokenize("metastasis") Create tokens like this: ['meta', '##sta', '##sis'] However, I want to keep the whole words as one token, like this: ['metastasis'] 回答1: You are free to add new tokens to the