seq2seq-Attention peeping into the encoder-states bypasses last encoder-hidden-state

问题

In the seq2seq-Model I want to use the hidden state at end of encoding to read out further info from the input sequence.

So I return the hidden state and build a new sub net on top of it. That works decently well. However, I have a doubt: This is supposed to become more complex, thus I am effectively relying of having ALL the necessary information for the additional task to be encoded in that hidden state.

If, however, the seq2seq-decoder uses the attention mechanism, it basically peeps into the encoder side, effectively bypassing the hidden state at end of encoding. Thus NOT ALL the info the seq2seq-network relies on is encoded in the hidden state at end of encoding.

Does that, in theory, mean that I have to not use the attention mechanism but go with plain-vanilla-seq2seq in order to get the maximum out of the hidden state at end of encoding? This would obviously sacrifice a big part of the effectiveness on the seq2seq-task.

Just trying to get a doubt confirmed I am having. Basically: Normally the last encoder-hidden-state in the seq2seq-model would contain ALL relevant info for decoding. But with attention this is no longer the case, right?

And on a more speculative note, do you agree with these possible solutions: - Create an additional attention mechanism for the new sub net? - Or, alternatively, use a convolution over all the hidden states of the encoder-side as additional input to the new sub net?

Any thoughts? Easier fixes?

Thx

回答1:

Bottom line, you should try different approaches and see what model works best for your data. Without knowing anything about your data or running some tests it is impossible to speculate on whether attention mechanism, CNN, etc. provides any benefits or not.

However, if you are using the tensorflow seq2seq models available in tensorflow/tensorflow/python/ops/seq2seq.py let me share some observations about the attention mechanism as implemented in embedding_attention_seq2seq() and attention_decoder() that related to your question(s):

Hidden state of decoder is initialized with the final state of encoder...so attention does not "effectively bypass the hidden state at end of encoding" IMHO

The following code in embedding_attention_seq2seq() passes in the last time step encoder_state as the initial_state in the 2nd argument:

  return embedding_attention_decoder(
      decoder_inputs, encoder_state, attention_states, cell,
      num_decoder_symbols, embedding_size, num_heads=num_heads,
      output_size=output_size, output_projection=output_projection,
      feed_previous=feed_previous,
      initial_state_attention=initial_state_attention)

And you can see that initial_state is used directly in attention_decoder() without going through any kind of attention states:

state = initial_state

...

for i, inp in enumerate(decoder_inputs):
  if i > 0:
    variable_scope.get_variable_scope().reuse_variables()
  # If loop_function is set, we use it instead of decoder_inputs.
  if loop_function is not None and prev is not None:
    with variable_scope.variable_scope("loop_function", reuse=True):
      inp = loop_function(prev, i)
  # Merge input and previous attentions into one vector of the right size.
  input_size = inp.get_shape().with_rank(2)[1]
  if input_size.value is None:
    raise ValueError("Could not infer input size from input: %s" % inp.name)
  x = linear([inp] + attns, input_size, True)
  # Run the RNN.
  cell_output, state = cell(x, state)
  ....

Attention states are combined with decoder inputs via learned linear combinations

x = linear([inp] + attns, input_size, True)

# Run the RNN.

cell_output, state = cell(x, state)

...the linear() does the W, b matrix operations to down rank the combined input + attn into the decoder input_size. The model will learn values for W and b.

Summary: the attention states are combined with inputs into the decoder, but the last hidden state of the encoder is fed in as the initial hidden state of the decoder without attention.

Finally, the attention mechanism still has the last encoding state at it's disposal and would only "bypass" it if learned that was the best thing to do during training.

来源：https://stackoverflow.com/questions/37309086/seq2seq-attention-peeping-into-the-encoder-states-bypasses-last-encoder-hidden-s

标签

python

tensorflow

recurrent-neural-network