Load Pretrained glove vectors in python

前端未结

关注

 10  575

I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file u

相关标签:

10条回答

你的背包

2021-01-29 22:32

glove model files are in a word - vector format. You can open the textfile to verify this. Here is a small snippet of code you can use to load a pretrained glove file:

import numpy as np

def loadGloveModel(File):
    print("Loading Glove Model")
    f = open(File,'r')
    gloveModel = {}
    for line in f:
        splitLines = line.split()
        word = splitLines[0]
        wordEmbedding = np.array([float(value) for value in splitLines[1:]])
        gloveModel[word] = wordEmbedding
    print(len(gloveModel)," words loaded!")
    return gloveModel

You can then access the word vectors by simply using the gloveModel variable.

print gloveModel['hello']

0 讨论(0)

别跟我提以往

2021-01-29 22:38

Python3 version which also handles bigrams and trigrams:

import numpy as np


def load_glove_model(glove_file):
    print("Loading Glove Model")
    f = open(glove_file, 'r')
    model = {}
    vector_size = 300
    for line in f:
        split_line = line.split()
        word = " ".join(split_line[0:len(split_line) - vector_size])
        embedding = np.array([float(val) for val in split_line[-vector_size:]])
        model[word] = embedding
    print("Done.\n" + str(len(model)) + " words loaded!")
    return model

0 讨论(0)

误落风尘

2021-01-29 22:38
Loading word embedding from a text file (in my case the glove.42B.300d embeddings) takes a bit long (147.2s on my machine).

What helps is converting the text file first into two new files: a text file that contains the words only (e.g. embeddings.vocab) and a binary file which contains the embedding vectors as numpy-structure (e.g. embeddings.npy).

Once converted, it takes me only 4.96s to load the same embeddings into the memory. This approach ends a up with exactly the same dictionary as if you load it from the text file. It is as efficient in access time and does not require any additional frameworks, but a lot faster in loading time.

With this code you convert your embedding text file to the two new files:
```
def convert_to_binary(embedding_path):
    f = codecs.open(embedding_path + ".txt", 'r', encoding='utf-8')
    wv = []

    with codecs.open(embedding_path + ".vocab", "w", encoding='utf-8') as vocab_write:
        count = 0
        for line in f:
            splitlines = line.split()
            vocab_write.write(splitlines[0].strip())
            vocab_write.write("\n")
            wv.append([float(val) for val in splitlines[1:]])
        count += 1

    np.save(embedding_path + ".npy", np.array(wv))
```
And with this method you load it efficiently into your memory:
```
def load_word_emb_binary(embedding_file_name_w_o_suffix):
    print("Loading binary word embedding from {0}.vocab and {0}.npy".format(embedding_file_name_w_o_suffix))

    with codecs.open(embedding_file_name_w_o_suffix + '.vocab', 'r', 'utf-8') as f_in:
        index2word = [line.strip() for line in f_in]

    wv = np.load(embedding_file_name_w_o_suffix + '.npy')
    word_embedding_map = {}
    for i, w in enumerate(index2word):
        word_embedding_map[w] = wv[i]

    return word_embedding_map
```
Disclaimer: This code is shamelessly stolen from https://blog.ekbana.com/loading-glove-pre-trained-word-embedding-model-from-text-file-faster-5d3e8f2b8455. But it might help in this thread.
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2021-01-29 22:40
Here's a one liner if all you want is the embedding matrix

np.loadtxt(path, usecols=range(1, dim+1), comments=None)

where path is path to your downloaded GloVe file and dim is the dimension of the word embedding.

If you want both the words and corresponding vectors you can do

glove = np.loadtxt(path, dtype='str', comments=None)

and seperate the words and vectors as follows
```
words = glove[:, 0]
vectors = glove[:, 1:].astype('float')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

南方客

2021-01-29 22:41

I found this approach faster.

import pandas as pd

df = pd.read_csv('glove.840B.300d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove = {key: val.values for key, val in df.T.items()}

Save the dictionary:

import pickle
with open('glove.840B.300d.pkl', 'wb') as fp:
    pickle.dump(glove, fp)

0 讨论(0)

执笔经年

2021-01-29 22:42

import os
import numpy as np

# store all the pre-trained word vectors
print('Loading word vectors...')
word2vec = {}
with open(os.path.join('glove/glove.6B.%sd.txt' % EMBEDDING_DIM)) as f: #enter the path where you unzipped the glove file
  # is just a space-separated text file in the format:
  # word vec[0] vec[1] vec[2] ...
    for line in f:
        values = line.split()
        word = values[0]
        vec = np.asarray(values[1:], dtype='float32')
        word2vec[word] = vec
print('Found %s word vectors.' % len(word2vec))

0 讨论(0)

1 2 下一页