问题
It is relatively easy to get a token's probability according to a language model, as the snippet below shows. You can get the output of a model, restrict yourself to the output of the masked token, and then find the probability of your requested token in the output vector. However, this only works with single-token words, e.g. words that are themselves in the tokenizer's vocabulary. When a word does not exist in the vocabulary, the tokenizer will chunk it up into pieces that it does know (see the bottom of the example). But since the input sentence consists of only one masked position, and the requested token has more tokens than that, how can we get its probability? Ultimately I am looking for a solution that works regardless of the number of subword units a word has.
In the code below I have added many comments explaining what is going on, as well as printing out the given output of print statements. You'll see that predicting tokens such as 'love' and 'hate' is straightforward because they are in the tokenizer's vocabulary. 'reprimand' is not, though, so it cannot be predicted in a single masked position - it consists of three subword units. So how can we predict 'reprimand' in the masked position?
from transformers import BertTokenizer, BertForMaskedLM
import torch
# init model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()
# init softmax to get probabilities later on
sm = torch.nn.Softmax(dim=0)
torch.set_grad_enabled(False)
# set sentence with MASK token, convert to token_ids
sentence = f"I {tokenizer.mask_token} you"
token_ids = tokenizer.encode(sentence, return_tensors='pt')
print(token_ids)
# tensor([[ 101, 1045, 103, 2017, 102]])
# get the position of the masked token
masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().item()
# forward
output = model(token_ids)
last_hidden_state = output[0].squeeze(0)
# only get output for masked token
# output is the size of the vocabulary
mask_hidden_state = last_hidden_state[masked_position]
# convert to probabilities (softmax)
# giving a probability for each item in the vocabulary
probs = sm(mask_hidden_state)
# get probability of token 'hate'
hate_id = tokenizer.convert_tokens_to_ids('hate')
print('hate probability', probs[hate_id].item())
# hate probability 0.008057191967964172
# get probability of token 'love'
love_id = tokenizer.convert_tokens_to_ids('love')
print('love probability', probs[love_id].item())
# love probability 0.6704086065292358
# get probability of token 'reprimand' (?)
reprimand_id = tokenizer.convert_tokens_to_ids('reprimand')
# reprimand is not in the vocabulary, so it needs to be split into subword units
print(tokenizer.convert_ids_to_tokens(reprimand_id))
# [UNK]
reprimand_id = tokenizer.encode('reprimand', add_special_tokens=False)
print(tokenizer.convert_ids_to_tokens(reprimand_id))
# ['rep', '##rim', '##and']
# but how do we now get the probability of a multi-token word in a single-token position?
回答1:
Since the split word does not present in the dictionary, BERT is simply unaware of it's probability, so there is no use of masking it before tokenization.
And you can't get it's probability by exploiting rule of chain, see response by J.Devlin. To illustrate it, let's take more generic example. Try to estimate the probability of some bigram in position i
. While you can estimate probability of each word given the sentence and their positions
P(w_i|w_0, w_1... w_i-1, w_i+1, ..., w_N)
,
P(w_i+1|w_0, w_1... w_i, wi+2, ..., w_N)
,
there is no way to get the probability of the bigram
P(w_i,w_i+1|w_0, w_1... w_i-1, wi+2, ..., w_N)
because BERT does not store such information.
Having said all that, you can get a very rough estimate of the probability of your OOV word by multiplying probabilities of seeing it's parts. So you will get
P("reprimand"|...) ~= P("rep"|...)*P("##rim"|...)*P("##and"|...)
Since your subwords are not regular words, but a special kind of words, this is not all wrong, because the dependency between them is implicit.
回答2:
Instead of
sentence = f"I {tokenizer.mask_token} you"
,
predict on:
"I [MASK] [MASK] you"
and
"I [MASK] [MASK] [MASK] you"
and filter results, dropping whole word token chains, so that you find only chains of suitable subwords. Of course you're going to get better results if you provide more than two surrounding context words.
But before you embark on that, reconsider your softmax. With dimension=0, it does a softmax calculation across all the token columns and all the token rows--not just the single token for which you want the softmax probability:
In [1]: import torch
In [2]: m = torch.nn.Softmax(dim=1)
...: input = torch.randn(2, 3)
...: input
Out[2]:
tensor([[ 1.5542, 0.3776, -0.8047],
[-0.3856, 1.1327, -0.1252]])
In [3]: m(input)
Out[3]:
tensor([[0.7128, 0.2198, 0.0674],
[0.1457, 0.6652, 0.1891]])
In [4]: soft = torch.nn.Softmax(dim=0)
...: soft(input)
Out[4]:
tensor([[0.8743, 0.3197, 0.3364],
[0.1257, 0.6803, 0.6636]])
来源:https://stackoverflow.com/questions/59435020/get-probability-of-multi-token-word-in-mask-position