how to have a LSTM Autoencoder model over the whole vocab prediction while presenting words as embedding

社会主义新天地 提交于 2019-12-13 20:50:15

问题


So I have been working on LSTM Autoencoder model. I have also created various version of this model.

1. create the model using the already trained word embedding: in this scenario, I used the weights of already trained Glove vector, as the weight of features(text data). This is the structure:

inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name="input")
    encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(inputs)
    encoded =Lambda(rev_entropy)(encoded)
    decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
    decoded = Bidirectional(LSTM(EMBED_SIZE, return_sequences=True), merge_mode="sum", name="decoder_lstm")(decoded)
    autoencoder = Model(inputs, decoded)
    autoencoder.compile(optimizer="sgd", loss='mse')
    autoencoder.summary()
    checkpoint = ModelCheckpoint(filepath='checkpoint/{epoch}.hdf5')
    history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS, validation_data=test_gen, validation_steps=num_test_steps, callbacks=[checkpoint])
  1. in the second scenario, I implemented the word embedding layer in the model itself:

This is the structure:

inputs = Input(shape=(SEQUENCE_LEN, ), name="input")
embedding = Embedding(input_dim=VOCAB_SIZE, output_dim=EMBED_SIZE, input_length=SEQUENCE_LEN,trainable=False)(inputs)
encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(embedding)
decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
decoded = LSTM(EMBED_SIZE, return_sequences=True)(decoded)
autoencoder = Model(inputs, decoded)
autoencoder.compile(optimizer="sgd", loss='categorical_crossentropy')
autoencoder.summary()   
checkpoint = ModelCheckpoint(filepath=os.path.join('Data/', "simple_ae_to_compare"))
history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS,  validation_steps=num_test_steps)
  1. in the third scenario, I did not use any embedding techniques but used the one hot encoding for the features. and this is the structure of the model:

    `inputs = Input(shape=(SEQUENCE_LEN, VOCAB_SIZE), name="input")
    encoded = Bidirectional(LSTM(LATENT_SIZE, kernel_initializer="glorot_normal",), merge_mode="sum", name="encoder_lstm")(inputs)
    encoded = Lambda(score_cooccurance,  name='Modified_layer')(encoded)
    decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
    decoded = LSTM(VOCAB_SIZE, return_sequences=True)(decoded)
    autoencoder = Model(inputs, decoded)
    sgd = optimizers.SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
    autoencoder.compile(optimizer=sgd, loss='categorical_crossentropy')
    autoencoder.summary()   
    checkpoint = ModelCheckpoint(filepath='checkpoint/50/{epoch}.hdf5')
    history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS, callbacks=[checkpoint])`
    

    As you see, in the first and second model Embed_size in the decoding is the number of neurons in that layer. it causes the output shape of encoder layer becomes [Latent_size, Embed_size].

    in the third model, the output shape of the encoder is [Latent_size, Vocab_size].

Now my question

Is it doable to change the structure of the model in a way I have embedding for representing my words to the model, and at the same time having vocab_size in the decoder layer?

I need to have output_shape of encoder layer be [Latent_size, Vocab_size] and at the same time I don't want to represent my features as the one_hot encoding for the obvious reason.

I appreciate if you can share your idea with me. One idea could be adding more layers, consider that with any cost I don't want to have Embed_size in the last layer.


回答1:


Your questions:

Is it doable to change the structure of the model in a way I have embedding for representing my words to the model, and at the same time having vocab_size in the decoder layer?

I like to use as reference the Tensorflow transformer model: https://github.com/tensorflow/models/tree/master/official/transformer

In language translation tasks the model input tends to be a token index, which then is subject to an embedding lookup resulting in a shape of (sequence_length, embedding_dims); the encoder itself works on this shape. The decoder output tends to be in the shape of (sequence_length, embedding_dims) also. For instance the model above, then transforms the decoder output into logits by doing a dot product between the output and the embedding vectors. This is the transformation they use: https://github.com/tensorflow/models/blob/master/official/transformer/model/embedding_layer.py#L94

I would recommend an approach similar to the language translation models:

  • pre-stage:
    • input_shape=(sequence_length, 1) [ i.e. token_index in [0.. vocab_size)
  • encoder:
    • input_shape=(sequence_length, embedding_dims)
    • output_shape=(latent_dims)
  • decoder:
    • input_shape=(latent_dims)
    • output_shape=(sequence_length, embedding_dims)

Pre-processing converts token indexes into embedding_dims. This can be used to generate both the encoder input as well as the decoder targets.

Post processing to convert embedding_dims to logits (in the vocab_index space).

I need to have output_shape of encoder layer be [Latent_size, Vocab_size] and at the same time I don't want to represent my features as the one_hot encoding for the obvious reason.

That doesn't sound right. Typically what one is trying to achieve with an auto-encoder is to have a embedding vector for the sentence. So the output of the encoder in typically [latent_dims]. The output of the decoder needs to be translatable into [sequence_length, vocab_index (1) ] which is typically done by converting from embedding space to logits and then taking the argmax to convert to token index.



来源:https://stackoverflow.com/questions/56938158/how-to-have-a-lstm-autoencoder-model-over-the-whole-vocab-prediction-while-prese

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!