How to handle tokens in text generation

问题

In my text generation dataset, I have converted all infrequent words into the token (unknown word), as suggested by most text-generation literature.

However, when training an RNN to take in part of a sentence as input and predict the rest of the sentence, I am not sure how I should stop the network from generating tokens. When the network encounters an unknown (infrequent) word in the training set, what should its output be?

Example:
Sentence: I went to the mall and bought a <ukn> and some groceries
Network input: I went to the mall and bought a
Current network output: <unk> and some groceries
Desired network output: ??? and some groceries

What should it be outputting instead of the <unk>?

I don't want to build a generator that outputs words it does not know.

回答1:

A RNN will give you a sampling of tokens that are most likely to appear next in your text. In your code you choose the token with the highest probability, in this case «unk».

In this case you can omit the «ukn» token and simply take the next most likely token that the RNN suggests based on the probability values that it renders.

回答2:

I've seen <UNK> occasionally, but never <UKN>.

Even more common in word-embedding-training is dropping rare words entirely, to keep vocabularies compact, and avoid having words-without-sufficient-examples from serving as 'noise' in the training of other words. (Folding them all into a single magic unknown-token – which then becomes more frequent than real tokens! – would just tend to throw a big unnatural pseudo-word with no clear meaning into every other word's contexts.)

So, I'm not sure it's accurate to describe this as "suggested by most text-generation literature". And to the extent it might be, wouldn't any source suggesting this then also suggest what-to-do when a prediction is the UNK token?

If your specific application needed any real known word instead, even if the NN has low confidence that the right word is any known-word, it would seem you'd just read the next-best-non-<UKN> prediction from the NN, as suggested by @petezurich's answer.

来源：https://stackoverflow.com/questions/51913706/how-to-handle-ukn-tokens-in-text-generation

标签

machine-learning