问题
I do understand conceptually what an LSTM or GRU should (thanks to this question What's the difference between "hidden" and "output" in PyTorch LSTM?) BUT when I inspect the output of the GRU h_n
and output
are NOT the same while they should be...
(Pdb) rnn_output
tensor([[[ 0.2663, 0.3429, -0.0415, ..., 0.1275, 0.0719, 0.1011],
[-0.1272, 0.3096, -0.0403, ..., 0.0589, -0.0556, -0.3039],
[ 0.1064, 0.2810, -0.1858, ..., 0.3308, 0.1150, -0.3348],
...,
[-0.0929, 0.2826, -0.0554, ..., 0.0176, -0.1552, -0.0427],
[-0.0849, 0.3395, -0.0477, ..., 0.0172, -0.1429, 0.0153],
[-0.0212, 0.1257, -0.2670, ..., -0.0432, 0.2122, -0.1797]]],
grad_fn=<StackBackward>)
(Pdb) hidden
tensor([[[ 0.1700, 0.2388, -0.4159, ..., -0.1949, 0.0692, -0.0630],
[ 0.1304, 0.0426, -0.2874, ..., 0.0882, 0.1394, -0.1899],
[-0.0071, 0.1512, -0.1558, ..., -0.1578, 0.1990, -0.2468],
...,
[ 0.0856, 0.0962, -0.0985, ..., 0.0081, 0.0906, -0.1234],
[ 0.1773, 0.2808, -0.0300, ..., -0.0415, -0.0650, -0.0010],
[ 0.2207, 0.3573, -0.2493, ..., -0.2371, 0.1349, -0.2982]],
[[ 0.2663, 0.3429, -0.0415, ..., 0.1275, 0.0719, 0.1011],
[-0.1272, 0.3096, -0.0403, ..., 0.0589, -0.0556, -0.3039],
[ 0.1064, 0.2810, -0.1858, ..., 0.3308, 0.1150, -0.3348],
...,
[-0.0929, 0.2826, -0.0554, ..., 0.0176, -0.1552, -0.0427],
[-0.0849, 0.3395, -0.0477, ..., 0.0172, -0.1429, 0.0153],
[-0.0212, 0.1257, -0.2670, ..., -0.0432, 0.2122, -0.1797]]],
grad_fn=<StackBackward>)
they are some transpose of each other...why?
回答1:
They are not really the same. Consider that we have the following Unidirectional GRU model:
import torch.nn as nn
import torch
gru = nn.GRU(input_size = 8, hidden_size = 50, num_layers = 3, batch_first = True)
Please make sure you observe the input shape carefully.
inp = torch.randn(1024, 112, 8)
out, hn = gru(inp)
Definitely,
torch.equal(out, hn)
False
One of the most efficient way that helped me to understand the output vs. hidden states was to view the hn
as hn.view(num_layers, num_directions, batch, hidden_size)
where num_directions = 2
for bidirectional recurrent networks (and 1 other wise, i.e., our case). Thus,
hn_conceptual_view = hn.view(3, 1, 1024, 50)
As the doc states (Note the italics and bolds):
h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len (i.e., for the last timestep)
In our case, this contains the hidden vector for the timestep t = 112
, where the:
output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features h_t from the last layer of the GRU, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence. For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively.
So, consequently, one can do:
torch.equal(out[:, -1], hn_conceptual_view[-1, 0, :, :])
True
Explanation: I compare the last sequence from all batches in out[:, -1]
to the last layer hidden vectors from hn[-1, 0, :, :]
For Bidirectional GRU (requires reading the unidirectional first):
gru = nn.GRU(input_size = 8, hidden_size = 50, num_layers = 3, batch_first = True bidirectional = True)
inp = torch.randn(1024, 112, 8)
out, hn = gru(inp)
View is changed to (since we have two directions):
hn_conceptual_view = hn.view(3, 2, 1024, 50)
If you try the exact code:
torch.equal(out[:, -1], hn_conceptual_view[-1, 0, :, :])
False
Explanation: This is because we are even comparing wrong shapes;
out[:, 0].shape
torch.Size([1024, 100])
hn_conceptual_view[-1, 0, :, :].shape
torch.Size([1024, 50])
Remember that for bidirectional networks, hidden states get concatenated at each time step where the first hidden_state
size (i.e., out[:, 0,
:50
]
) are the the hidden states for the forward network, and the other hidden_state
size are for the backward (i.e., out[:, 0,
50:
]
). The correct comparison for the forward network is then:
torch.equal(out[:, -1, :50], hn_conceptual_view[-1, 0, :, :])
True
If you want the hidden states for the backward network, and since a backward network processes the sequence from time step n ... 1. You compare the first timestep of the sequence but the last hidden_state
size and changing the hn_conceptual_view
direction to 1
:
torch.equal(out[:, -1, :50], hn_conceptual_view[-1, 1, :, :])
True
In a nutshell, generally speaking:
Unidirectional:
rnn_module = nn.RECURRENT_MODULE(num_layers = X, hidden_state = H, batch_first = True)
inp = torch.rand(B, S, E)
output, hn = rnn_module(inp)
hn_conceptual_view = hn.view(X, 1, B, H)
Where RECURRENT_MODULE
is either GRU or LSTM (at the time of writing this post), B
is the batch size, S
sequence length, and E
embedding size.
torch.equal(output[:, S, :], hn_conceptual_view[-1, 0, :, :])
True
Again we used S
since the rnn_module
is forward (i.e., unidirectional) and the last timestep is stored at the sequence length S
.
Bidirectional:
rnn_module = nn.RECURRENT_MODULE(num_layers = X, hidden_state = H, batch_first = True, bidirectional = True)
inp = torch.rand(B, S, E)
output, hn = rnn_module(inp)
hn_conceptual_view = hn.view(X, 2, B, H)
Comparison
torch.equal(output[:, S, :H], hn_conceptual_view[-1, 0, :, :])
True
Above is the forward network comparison, we used :H
because the forward stores its hidden vector in the first H
elements for each timestep.
For the backward network:
torch.equal(output[:, 0, H:], hn_conceptual_view[-1, 1, :, :])
True
We changed the direction in hn_conceptual_view
to 1
to get hidden vectors for the backward network.
For all examples we used hn_conceptual_view[-1, ...]
because we are only interested in the last layer.
回答2:
Is Not the transpose , you can get rnn_output = hidden[-1] when the layer of lstm is 1
hidden is a output of every cell every layer, it's shound be a 2D array for a specifc input time step , but lstm return all the time step , so the output of a layer should be hidden[-1]
and this situation discussed when batch is 1 , or the dimention of output and hidden need to add one
来源:https://stackoverflow.com/questions/56677052/is-hidden-and-output-the-same-for-a-gru-unit-in-pytorch