问题
I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any solution to it? For example:
tokenizer = BertTokenizer('bert-base-uncased-vocab.txt')
tokens = tokenizer.tokenize("metastasis")
Create tokens like this:
['meta', '##sta', '##sis']
However, I want to keep the whole words as one token, like this:
['metastasis']
回答1:
You are free to add new tokens to the existing pretrained tokenizer, but then you need to train your model with the improved tokenizer (extra tokens).
Example:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
v = tokenizer.get_vocab()
print(len(v))
tokenizer.add_tokens(['whatever', 'underdog'])
v = tokenizer.get_vocab()
print(len(v))
If token already exists like 'whatever' it will not be added.
Output:
30522
30523
回答2:
Based on the discussion here, one way to use my own additional vocabulary dictionary which is containing the specific words is to modify the first ~1000 lines of the vocab.txt file ([unused] lines) with the specific words. For example I replaced '[unused1]' with 'metastasis' in the vocab.txt and after tokenization with the modified vocab.txt I got this output:
tokens = tokenizer.tokenize("metastasis")
Output: ['metastasis']
来源:https://stackoverflow.com/questions/62082938/how-to-stop-bert-from-breaking-apart-specific-words-into-word-piece