问题
I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for example, "playing" -> "play" + "##ing").
Right now, I have my text preprocessed using a standard tokenizer that splits on spaces / some punctuation, and then I have a lemmatizer ("playing" ->"play"). I'm wondering what the benefit of WordPiece tokenization is over a standard tokenization + lemmatization. I know WordPiece helps with out of vocabulary words, but is there anything else? That is, even if I don't end up using BERT, should I consider replacing my tokenizer + lemmatizer with wordpiece tokenization? In what situations would that be useful?
回答1:
The word-piece tokenization helps in multiple ways, and should be better than lemmatizer. Due to multiple reasons:
- If you have the words 'playful', 'playing', 'played', to be lemmatized to 'play', it can lose some information such as
playing
is present-tense andplayed
is past-tense, which doesn't happen in word-piece tokenization. - Word piece tokens cover all the word, even the words that do not occur in the dictionary. It splits the words and there will be word-piece tokens, that way, you shall have embeddings for the split word-pieces, unlike removing the words or replacing with 'unknown' token.
Usage of word-piece tokenization instead of tokenizer+lemmatizer is merely a design choice, word-piece tokenization should perform well. But you may have to take into count because word-piece tokenization increases the number of tokens, which is not the case in lemmatization.
来源:https://stackoverflow.com/questions/57057992/wordpiece-tokenization-versus-conventional-lemmatization