LSTM with Attention

坚强是说给别人听的谎言 提交于 2019-12-04 15:43:31

问题


I am trying to add attention mechanism to stacked LSTMs implementation https://github.com/salesforce/awd-lstm-lm

All examples online use encoder-decoder architecture, which I do not want to use (do I have to for the attention mechanism?).

Basically, I have used https://webcache.googleusercontent.com/search?q=cache:81Q7u36DRPIJ:https://github.com/zhedongzheng/finch/blob/master/nlp-models/pytorch/rnn_attn_text_clf.py+&cd=2&hl=en&ct=clnk&gl=uk

def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, dropouth=0.5, dropouti=0.5, dropoute=0.1, wdrop=0, tie_weights=False):
    super(RNNModel, self).__init__()
    self.encoder = nn.Embedding(ntoken, ninp)
    self.rnns = [torch.nn.LSTM(ninp if l == 0 else nhid, nhid if l != nlayers - 1 else (ninp if tie_weights else nhid), 1, dropout=0) for l in range(nlayers)]
    for rnn in self.rnns:
        rnn.linear = WeightDrop(rnn.linear, ['weight'], dropout=wdrop)
    self.rnns = torch.nn.ModuleList(self.rnns)
    self.attn_fc = torch.nn.Linear(ninp, 1)
    self.decoder = nn.Linear(nhid, ntoken)

    self.init_weights()

def attention(self, rnn_out, state):
    state = torch.transpose(state, 1,2)
    weights = torch.bmm(rnn_out, state)# torch.bmm(rnn_out, state)
    weights = torch.nn.functional.softmax(weights)#.squeeze(2)).unsqueeze(2)
    rnn_out_t = torch.transpose(rnn_out, 1, 2)
    bmmed = torch.bmm(rnn_out_t, weights)
    bmmed = bmmed.squeeze(2)
    return bmmed

def forward(self, input, hidden, return_h=False, decoder=False, encoder_outputs=None):
    emb = embedded_dropout(self.encoder, input, dropout=self.dropoute if self.training else 0)
    emb = self.lockdrop(emb, self.dropouti)

    new_hidden = []
    raw_outputs = []
    outputs = []
    for l, rnn in enumerate(self.rnns):
        temp = []
        for item in emb:
            item = item.unsqueeze(0)
            raw_output, new_h = rnn(item, hidden[l])

            raw_output = self.attention(raw_output, new_h[0])

            temp.append(raw_output)
        raw_output = torch.stack(temp)
        raw_output = raw_output.squeeze(1)

        new_hidden.append(new_h)
        raw_outputs.append(raw_output)
        if l != self.nlayers - 1:
            raw_output = self.lockdrop(raw_output, self.dropouth)
            outputs.append(raw_output)
    hidden = new_hidden

    output = self.lockdrop(raw_output, self.dropout)
    outputs.append(output)

    outputs = torch.stack(outputs).squeeze(0)
    outputs = torch.transpose(outputs, 2,1)
    output = output.transpose(2,1)
    output = output.contiguous()
    decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
    result = decoded.view(output.size(0), output.size(1), decoded.size(1))
    if return_h:
        return result, hidden, raw_outputs, outputs
    return result, hidden

This model is training, but my loss is quite high as compared to the model without the attention model.


回答1:


I understood your question but it is a bit tough to follow your code and find the reason why the loss is not decreasing. Also, it is not clear why you want to compare the last hidden state of the RNN with all the hidden states at every time step.

Please note, a particular trick/mechanism is useful if you use it in the correct way. The way you are trying to use attention mechanism, I am not sure if it is the correct way. So, don't expect that since you are using attention trick in your model, you will get good results!! You should think, why attention mechanism will bring advantage to your desired task?


You didn't clearly mention what is that task you are targetting? Since you have pointed to a repo which contains code on language modeling, I am guessing the task is: given a sequence of tokens, predict the next token.

One possible problem I can see in your code is: in the for item in emb: loop, you will always use the embedddings as input to each LSTM layer, so having a stacked LSTM doesn't make sense to me.


Now, let me first answer your question and then show step-by-step how can you build your desired NN architecture.

Do I need to use encoder-decoder architecture to use attention mechanism?

The encoder-decoder architecture is better known as sequence-to-sequence to learning and it is widely used in many generation task, for example, machine translation. The answer to your question is no, you are not required to use any specific neural network architecture to use attention mechanism.


The structure you presented in the figure is little ambiguous but should be easy to implement. Since your implementation is not clear to me, I am trying to guide you to a better way of implementing it. For the following discussion, I am assuming we are dealing with text inputs.

Let's say, we have an input of shape 16 x 10 where 16 is batch_size and 10 is seq_len. We can assume we have 16 sentences in a mini-batch and each sentence length is 10.

batch_size, vocab_size = 16, 100
mat = np.random.randint(vocab_size, size=(batch_size, 10))
input_var = Variable(torch.from_numpy(mat))

Here, 100 can be considered as the vocabulary size. It is important to note that throughout the example I am providing, I am assuming batch_size as the first dimension in all respective tensors/variables.

Now, let's embed the input variable.

embedding = nn.Embedding(100, 50)
embed = embedding(input_var)

After embedding, we got a variable of shape 16 x 10 x 50 where 50 is the embedding size.

Now, let's define a 2-layer unidirectional LSTM with 100 hidden units at each layer.

rnns = nn.ModuleList()
nlayers, input_size, hidden_size = 2, 50, 100
for i in range(nlayers):
    input_size = input_size if i == 0 else hidden_size
    rnns.append(nn.LSTM(input_size, hidden_size, 1, batch_first=True))

Then, we can feed our input to this 2-layer LSTM to get the output.

sent_variable = embed
outputs, hid = [], []
for i in range(nlayers):
    if i != 0:
        sent_variable = F.dropout(sent_variable, p=0.3, training=True)
    output, hidden = rnns[i](sent_variable)
    outputs.append(output)
    hid.append(hidden[0].squeeze(0))
    sent_variable = output

rnn_out = torch.cat(outputs, 2)
hid = torch.cat(hid, 1)

Now, you can simply use the hid to predict the next word. I would suggest you do that. Here, shape of hid is batch_size x (num_layers*hidden_size).

But since you want to use attention to compute soft alignment score between last hidden states with each hidden states produced by LSTM layers, let's do this.

sent_variable = embed
hid, con = [], []
for i in range(nlayers):
    if i != 0:
        sent_variable = F.dropout(sent_variable, p=0.3, training=True)
    output, hidden = rnns[i](sent_variable)
    sent_variable = output

    hidden = hidden[0].squeeze(0) # batch_size x hidden_size
    hid.append(hidden)
    weights = torch.bmm(output[:, 0:-1, :], hidden.unsqueeze(2)).squeeze(2)  
    soft_weights = F.softmax(weights, 1)  # batch_size x seq_len
    context = torch.bmm(output[:, 0:-1, :].transpose(1, 2), soft_weights.unsqueeze(2)).squeeze(2)
    con.append(context)

hid, con = torch.cat(hid, 1), torch.cat(con, 1)
combined = torch.cat((hid, con), 1)

Here, we compute soft alignment score between the last state with all the states of each time step. Then we compute a context vector which is just a linear combination of all the hidden states. We combine them to form a single representation.

Please note, I have removed the last hidden states from output: output[:, 0:-1, :] since you are comparing with last hidden state itself.

The final combined representation stores the last hidden states and context vectors produced at each layer. You can directly use this representation to predict the next word.

Predicting the next word is straight-forward and as you are using a simple linear layer is just fine.


Edit: We can do the following to predict the next word.

decoder = nn.Linear(nlayers * hidden_size * 2, vocab_size)
dec_out = decoder(combined)

Here, the shape of dec_out is batch_size x vocab_size. Now, we can compute negative log-likelihood loss which will be used to backpropagate later.

Before computing the negative log-likelihood loss, we need to apply log_softmax to the output of the decoder.

dec_out = F.log_softmax(dec_out, 1)
target = np.random.randint(vocab_size, size=(batch_size))
target = Variable(torch.from_numpy(target))

And, we also defined the target which is required to compute the loss. See NLLLoss for details. So, now we can compute the loss as follows.

criterion = nn.NLLLoss()
loss = criterion(dec_out, target)
print(loss)

The printed loss value is:

Variable containing:
 4.6278
[torch.FloatTensor of size 1]

Hope the entire explanation helps you!!




回答2:


The whole point of attention, is that word order in different languages is different and thus when decoding the 5th word in the target language you might need to pay attention to the 3rd word (or encoding of the 3rd word) in the source language because these are the words which correspond to each other. That is why you mostly see attention used with an encoder decoder structure.

If I understand correctly, you are doing next word prediction? In that case it might still make sense to use attention because the next word might highly depend on the word 4 steps in the past.

So basically what you need is:

rnn: which takes in input of shape MBxninp and hidden of shape MBxnhid and outputs h of shape MBxnhid.

h, next_hidden = rnn(input, hidden)

attention: which takes in sequence of h's and the last h_last decides how each of them is important by giving each a weight w.

w = attention(hs, h_last)

where w is of shape seq_len x MB x 1, hs is of shape seq_len x MB x nhid, and h_last is of shape MB x nhid.

Now you weight the hs by w:

h_att = torch.sum(w*hs, dim=0) #shape MB x n_hid

Now the point is you need to do that for every time step:

h_att_list = []
h_list = []
hidden = hidden_init
for word in embedded_words:
    h, hidden = rnn(word, hidden)
    h_list.append(h)
    h_att = attention(torch.stack(h_list), h)
    h_att_list.append(h_att)

And then you can apply the decoder (which might need to be an MLP rather than just a linear transformation) on h_att_list.



来源:https://stackoverflow.com/questions/49086221/lstm-with-attention

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!