How to handle variable length data for LSTM

我怕爱的太早我们不能终老 提交于 2021-01-29 07:15:19

问题


From what I know the general steps to preprocess data for LSTM include the following steps

vocab_size = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=vocab_size)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train0 = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_val0 = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)

Here, train will consist of 25,000 samples of variable length, and after applying the sequence padding it will truncate data to be of length 200 if the sequence is more than 200 in length & will pad 0's to the data for cases where data is shorter than 200 in length.

But this isn't a big problem if your data is sequence of length 200 +/- 50 or has a range from 90 to 500.

How do you tackle a problem where data length's range from 100 -> 60,000 length?


回答1:


**There is a way to handle that in LSTM architecture: **

  1. In you lstm set the timestep component of input_shape argument as None, this will help you accept sequence of variable length.

  2. Now one problem will raise because you will have to fit the inputs into numpy array, which has a strict structure (same length). So what I do is Group you inputs into batches of same length and make an array of it. Now feed it to your network.

Ex-

lstm = tf.keras.layers.LSTM(latent_dim, input_shape=(None, vocab_len, ))

for ip in inputs.groupby(lenghtofinputs):
    model.fit(ip, outputof(ip), epoch=100)

Please let me know if it's working with your case, works for me.



来源:https://stackoverflow.com/questions/63663399/how-to-handle-variable-length-data-for-lstm

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!