I am trying to re-train a word2vec model in Keras 2 with Tensorflow backend by using pretrained embeddings and custom corpus.
This is how I initialize the embeddings lay
Instead of using the embeddings_initializer
argument of the Embedding layer you can load pre-trained weights for your embedding layer using the weights
argument, this way you should be able to hand over pre-trained embeddings larger than 2GB.
Here is a short example:
from keras.layers import Embedding
embedding_layer = Embedding(vocab_size,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
Where embedding_matrix
is just a regular numpy matrix containing your weights.
For for examples you can also take a look here:
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
Edit:
As @PavlinMavrodiev (see end of question) pointed out correctly the weights
argument is deprecated. He instead used the layer method set_weights to set the weights instead:
layer.set_weights(weights)
: sets the weights of the layer from a list of Numpy arrays (with the same shapes as the output ofget_weights
).
To get trained weights get_weights
can be used:
layer.get_weights()
: returns the weights of the layer as a list of Numpy arrays.
Both are methods from the Keras Layer-Baseclass and can be used for all keras layers, including embeddings layer.