Load Pretrained glove vectors in python

前端 未结 10 575
眼角桃花
眼角桃花 2021-01-29 22:12

I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file u

相关标签:
10条回答
  • 2021-01-29 22:32

    glove model files are in a word - vector format. You can open the textfile to verify this. Here is a small snippet of code you can use to load a pretrained glove file:

    import numpy as np
    
    def loadGloveModel(File):
        print("Loading Glove Model")
        f = open(File,'r')
        gloveModel = {}
        for line in f:
            splitLines = line.split()
            word = splitLines[0]
            wordEmbedding = np.array([float(value) for value in splitLines[1:]])
            gloveModel[word] = wordEmbedding
        print(len(gloveModel)," words loaded!")
        return gloveModel
    

    You can then access the word vectors by simply using the gloveModel variable.

    print gloveModel['hello']

    0 讨论(0)
  • 2021-01-29 22:38

    Python3 version which also handles bigrams and trigrams:

    import numpy as np
    
    
    def load_glove_model(glove_file):
        print("Loading Glove Model")
        f = open(glove_file, 'r')
        model = {}
        vector_size = 300
        for line in f:
            split_line = line.split()
            word = " ".join(split_line[0:len(split_line) - vector_size])
            embedding = np.array([float(val) for val in split_line[-vector_size:]])
            model[word] = embedding
        print("Done.\n" + str(len(model)) + " words loaded!")
        return model
    
    0 讨论(0)
  • 2021-01-29 22:38

    Loading word embedding from a text file (in my case the glove.42B.300d embeddings) takes a bit long (147.2s on my machine).

    What helps is converting the text file first into two new files: a text file that contains the words only (e.g. embeddings.vocab) and a binary file which contains the embedding vectors as numpy-structure (e.g. embeddings.npy).

    Once converted, it takes me only 4.96s to load the same embeddings into the memory. This approach ends a up with exactly the same dictionary as if you load it from the text file. It is as efficient in access time and does not require any additional frameworks, but a lot faster in loading time.

    With this code you convert your embedding text file to the two new files:

    def convert_to_binary(embedding_path):
        f = codecs.open(embedding_path + ".txt", 'r', encoding='utf-8')
        wv = []
    
        with codecs.open(embedding_path + ".vocab", "w", encoding='utf-8') as vocab_write:
            count = 0
            for line in f:
                splitlines = line.split()
                vocab_write.write(splitlines[0].strip())
                vocab_write.write("\n")
                wv.append([float(val) for val in splitlines[1:]])
            count += 1
    
        np.save(embedding_path + ".npy", np.array(wv))
    

    And with this method you load it efficiently into your memory:

    def load_word_emb_binary(embedding_file_name_w_o_suffix):
        print("Loading binary word embedding from {0}.vocab and {0}.npy".format(embedding_file_name_w_o_suffix))
    
        with codecs.open(embedding_file_name_w_o_suffix + '.vocab', 'r', 'utf-8') as f_in:
            index2word = [line.strip() for line in f_in]
    
        wv = np.load(embedding_file_name_w_o_suffix + '.npy')
        word_embedding_map = {}
        for i, w in enumerate(index2word):
            word_embedding_map[w] = wv[i]
    
        return word_embedding_map
    

    Disclaimer: This code is shamelessly stolen from https://blog.ekbana.com/loading-glove-pre-trained-word-embedding-model-from-text-file-faster-5d3e8f2b8455. But it might help in this thread.

    0 讨论(0)
  • Here's a one liner if all you want is the embedding matrix

    np.loadtxt(path, usecols=range(1, dim+1), comments=None)

    where path is path to your downloaded GloVe file and dim is the dimension of the word embedding.

    If you want both the words and corresponding vectors you can do

    glove = np.loadtxt(path, dtype='str', comments=None)

    and seperate the words and vectors as follows

    words = glove[:, 0]
    vectors = glove[:, 1:].astype('float')
    
    0 讨论(0)
  • 2021-01-29 22:41

    I found this approach faster.

    import pandas as pd
    
    df = pd.read_csv('glove.840B.300d.txt', sep=" ", quoting=3, header=None, index_col=0)
    glove = {key: val.values for key, val in df.T.items()}
    

    Save the dictionary:

    import pickle
    with open('glove.840B.300d.pkl', 'wb') as fp:
        pickle.dump(glove, fp)
    
    0 讨论(0)
  • 2021-01-29 22:42
    import os
    import numpy as np
    
    # store all the pre-trained word vectors
    print('Loading word vectors...')
    word2vec = {}
    with open(os.path.join('glove/glove.6B.%sd.txt' % EMBEDDING_DIM)) as f: #enter the path where you unzipped the glove file
      # is just a space-separated text file in the format:
      # word vec[0] vec[1] vec[2] ...
        for line in f:
            values = line.split()
            word = values[0]
            vec = np.asarray(values[1:], dtype='float32')
            word2vec[word] = vec
    print('Found %s word vectors.' % len(word2vec))
    
    0 讨论(0)
提交回复
热议问题