Batch-major vs time-major LSTM

问题

Do RNNs learn different dependency patterns when the input is batch-major as opposed to time-major?

回答1:

(Edit: sorry my initial argument was why it makes sense but I realized that it doesn't so this is a little OT.)

I haven't found the TF-groups reasoning behind this but it ~~does~~ does not make computational sense as ops are written in C++.

Intuitively, we want to mash up (multiply/add etc) different features from the same sequence on the same timestep. Different timesteps can’t be done in parallell while batch/sequences can so feature>batch/sequence>timestep.

By default Numpy and C++ uses row-major (C-like) memory layout so

[[ 0.  1.  2.]
 [ 3.  4.  5.]
 [ 6.  7.  8.]]

Is laying like [0,1,2,3,4,5,6,7,8] in memory. This means that if we have

x = np.zeros([time,batch,feature])

(time_major=True in tensorflow)

In Row-major memory we get a layout like x[0,0,0],x[0,0,1],x[0,0,2],…,x[0,1,0],... so ex. dot product of weights and vectors from the same sequence and timestep (w*x[t,b,:]) is the most contiguous operation followed by next sequence w*x[t,b+1,:] etc. This is what we want during training.

With time_major=False which is default we have [batch,time,feature] so ex features from same sequence but different timesteps are more contiguous i.e w*x[batch,t,:] followed by w*x[batch,t+1,:] etc. This might be faster for prediction of one sequence at a time if RNN is rolled out but this is speculation.

If you came to this question for the same reason I did, I learned to be careful with the slightly unintuitive Numpy-indexing which is meant to be pythonic, not necessarily Row Major. Look at this. As expected:

x = np.zeros([3,3])
x[0:9].flat = np.arange(10)
print x
>   [[ 0.  1.  2.]
>    [ 3.  4.  5.]
>    [ 6.  7.  8.]]

We would also expect x[1] == x[0,1] but

print x[1]
> [ 3.  4.  5.]

print x[np.arange(10)<=4]
> IndexError: index 3 is out of bounds for axis 0 with size 3

回答2:

There is no difference in what the model learns.

At timestep t, RNNs need results from t-1, therefore we need to compute things time-major. If time_major=False, TensorFlow transposes batch of sequences from (batch_size, max_sequence_length) to (max_sequence_length, batch_size)*. It processes the transposed batch one row at a time: at t=0, the first element of each sequence is processed, hidden states and outputs calculated; at t=max_sequence_length, the last element of each sequence is processed.

So if your data is already time-major, use time_major=True, which avoids a transpose. But there isn't much point in manually transposing your data before feeding it to TensorFlow.

*If you have multidimensional inputs (e.g. sequences of word embeddings: (batch_size, max_sequence_length, embedding_size)), axes 0 and 1 are transposed, leading to (max_sequence_length, batch_size, embedding_size)

来源：https://stackoverflow.com/questions/42130491/batch-major-vs-time-major-lstm

标签

python

tensorflow

deep-learning

lstm

recurrent-neural-network