Get probability of multi-token word in MASK position

橙三吉。 提交于 2020-12-05 11:57:31

问题


It is relatively easy to get a token's probability according to a language model, as the snippet below shows. You can get the output of a model, restrict yourself to the output of the masked token, and then find the probability of your requested token in the output vector. However, this only works with single-token words, e.g. words that are themselves in the tokenizer's vocabulary. When a word does not exist in the vocabulary, the tokenizer will chunk it up into pieces that it does know (see the bottom of the example). But since the input sentence consists of only one masked position, and the requested token has more tokens than that, how can we get its probability? Ultimately I am looking for a solution that works regardless of the number of subword units a word has.

In the code below I have added many comments explaining what is going on, as well as printing out the given output of print statements. You'll see that predicting tokens such as 'love' and 'hate' is straightforward because they are in the tokenizer's vocabulary. 'reprimand' is not, though, so it cannot be predicted in a single masked position - it consists of three subword units. So how can we predict 'reprimand' in the masked position?

from transformers import BertTokenizer, BertForMaskedLM
import torch

# init model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()
# init softmax to get probabilities later on
sm = torch.nn.Softmax(dim=0)
torch.set_grad_enabled(False)

# set sentence with MASK token, convert to token_ids
sentence = f"I {tokenizer.mask_token} you"
token_ids = tokenizer.encode(sentence, return_tensors='pt')
print(token_ids)
# tensor([[ 101, 1045,  103, 2017,  102]])
# get the position of the masked token
masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().item()

# forward
output = model(token_ids)
last_hidden_state = output[0].squeeze(0)
# only get output for masked token
# output is the size of the vocabulary
mask_hidden_state = last_hidden_state[masked_position]
# convert to probabilities (softmax)
# giving a probability for each item in the vocabulary
probs = sm(mask_hidden_state)

# get probability of token 'hate'
hate_id = tokenizer.convert_tokens_to_ids('hate')
print('hate probability', probs[hate_id].item())
# hate probability 0.008057191967964172

# get probability of token 'love'
love_id = tokenizer.convert_tokens_to_ids('love')
print('love probability', probs[love_id].item())
# love probability 0.6704086065292358

# get probability of token 'reprimand' (?)
reprimand_id = tokenizer.convert_tokens_to_ids('reprimand')
# reprimand is not in the vocabulary, so it needs to be split into subword units
print(tokenizer.convert_ids_to_tokens(reprimand_id))
# [UNK]

reprimand_id = tokenizer.encode('reprimand', add_special_tokens=False)
print(tokenizer.convert_ids_to_tokens(reprimand_id))
# ['rep', '##rim', '##and']
# but how do we now get the probability of a multi-token word in a single-token position?

回答1:


Since the split word does not present in the dictionary, BERT is simply unaware of it's probability, so there is no use of masking it before tokenization.

And you can't get it's probability by exploiting rule of chain, see response by J.Devlin. To illustrate it, let's take more generic example. Try to estimate the probability of some bigram in position i. While you can estimate probability of each word given the sentence and their positions

P(w_i|w_0, w_1... w_i-1, w_i+1, ..., w_N),

P(w_i+1|w_0, w_1... w_i, wi+2, ..., w_N),

there is no way to get the probability of the bigram

P(w_i,w_i+1|w_0, w_1... w_i-1, wi+2, ..., w_N)

because BERT does not store such information.

Having said all that, you can get a very rough estimate of the probability of your OOV word by multiplying probabilities of seeing it's parts. So you will get

P("reprimand"|...) ~= P("rep"|...)*P("##rim"|...)*P("##and"|...)

Since your subwords are not regular words, but a special kind of words, this is not all wrong, because the dependency between them is implicit.




回答2:


Instead of sentence = f"I {tokenizer.mask_token} you", predict on: "I [MASK] [MASK] you" and "I [MASK] [MASK] [MASK] you" and filter results, dropping whole word token chains, so that you find only chains of suitable subwords. Of course you're going to get better results if you provide more than two surrounding context words.

But before you embark on that, reconsider your softmax. With dimension=0, it does a softmax calculation across all the token columns and all the token rows--not just the single token for which you want the softmax probability:

In [1]: import torch                                                                                                                      
In [2]: m = torch.nn.Softmax(dim=1) 
   ...: input = torch.randn(2, 3) 
   ...: input                                                                                                                        
Out[2]: 
tensor([[ 1.5542,  0.3776, -0.8047],
        [-0.3856,  1.1327, -0.1252]])

In [3]: m(input)                                                                                                                          
Out[3]: 
tensor([[0.7128, 0.2198, 0.0674],
        [0.1457, 0.6652, 0.1891]])

In [4]: soft = torch.nn.Softmax(dim=0) 
   ...: soft(input)                                                                                                                       
Out[4]: 
tensor([[0.8743, 0.3197, 0.3364],
        [0.1257, 0.6803, 0.6636]])


来源:https://stackoverflow.com/questions/59435020/get-probability-of-multi-token-word-in-mask-position

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!