Wordpiece tokenization versus conventional lemmatization?

蹲街弑〆低调 提交于 2021-01-02 06:28:10

问题


I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for example, "playing" -> "play" + "##ing").

Right now, I have my text preprocessed using a standard tokenizer that splits on spaces / some punctuation, and then I have a lemmatizer ("playing" ->"play"). I'm wondering what the benefit of WordPiece tokenization is over a standard tokenization + lemmatization. I know WordPiece helps with out of vocabulary words, but is there anything else? That is, even if I don't end up using BERT, should I consider replacing my tokenizer + lemmatizer with wordpiece tokenization? In what situations would that be useful?


回答1:


The word-piece tokenization helps in multiple ways, and should be better than lemmatizer. Due to multiple reasons:

  1. If you have the words 'playful', 'playing', 'played', to be lemmatized to 'play', it can lose some information such as playing is present-tense and played is past-tense, which doesn't happen in word-piece tokenization.
  2. Word piece tokens cover all the word, even the words that do not occur in the dictionary. It splits the words and there will be word-piece tokens, that way, you shall have embeddings for the split word-pieces, unlike removing the words or replacing with 'unknown' token.

Usage of word-piece tokenization instead of tokenizer+lemmatizer is merely a design choice, word-piece tokenization should perform well. But you may have to take into count because word-piece tokenization increases the number of tokens, which is not the case in lemmatization.



来源:https://stackoverflow.com/questions/57057992/wordpiece-tokenization-versus-conventional-lemmatization

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!