How do I create a Keras Embedding layer from a pre-trained word embedding dataset?

孤者浪人 提交于 2019-12-04 16:41:29

You will need to pass an embeddingMatrix to the Embedding layer as follows:

Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)

  • vocabLen: number of tokens in your vocabulary
  • embDim: embedding vectors dimension (50 in your example)
  • embeddingMatrix: embedding matrix built from glove.6B.50d.txt
  • isTrainable: whether you want the embeddings to be trainable or froze the layer

The glove.6B.50d.txt is a list of whitespace-separated values: word token + (50) embedding values. e.g. the 0.418 0.24968 -0.41242 ...

To create a pretrainedEmbeddingLayer from a Glove file:

# Prepare Glove File
def readGloveFile(gloveFile):
    with open(gloveFile, 'r') as f:
        wordToGlove = {}  # map from a token (word) to a Glove embedding vector
        wordToIndex = {}  # map from a token to an index
        indexToWord = {}  # map from an index to a token 

        for line in f:
            record = line.strip().split()
            token = record[0] # take the token (word) from the text line
            wordToGlove[token] = np.array(record[1:], dtype=np.float64) # associate the Glove embedding vector to a that token (word)

        tokens = sorted(wordToGlove.keys())
        for idx, tok in enumerate(tokens):
            kerasIdx = idx + 1  # 0 is reserved for masking in Keras (see above)
            wordToIndex[tok] = kerasIdx # associate an index to a token (word)
            indexToWord[kerasIdx] = tok # associate a word to a token (word). Note: inverse of dictionary above

    return wordToIndex, indexToWord, wordToGlove

# Create Pretrained Keras Embedding Layer
def createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, isTrainable):
    vocabLen = len(wordToIndex) + 1  # adding 1 to account for masking
    embDim = next(iter(wordToGlove.values())).shape[0]  # works with any glove dimensions (e.g. 50)

    embeddingMatrix = np.zeros((vocabLen, embDim))  # initialize with zeros
    for word, index in wordToIndex.items():
        embeddingMatrix[index, :] = wordToGlove[word] # create embedding: word index to Glove word embedding

    embeddingLayer = Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)
    return embeddingLayer

# usage
wordToIndex, indexToWord, wordToGlove = readGloveFile("/path/to/glove.6B.50d.txt")
pretrainedEmbeddingLayer = createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, False)
model = Sequential()
model.add(pretrainedEmbeddingLayer)
...

There is one great blog post describing how to create embedding layer with pre-trained word vector embeddings:

https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

Code for the above article can be found here:

https://github.com/keras-team/keras/blob/master/examples/pretrained_word_embeddings.py

Another good blog for the same purpose: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!