Load Pretrained glove vectors in python

前端未结

关注

 10  576

I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file u

相关标签:

10条回答

礼貌的吻别

2021-01-29 22:44

EMBEDDING_LIFE = 'path/to/your/glove.txt'

def get_coefs(word,*arr): 
      return word, np.asarray(arr, dtype='float32')

embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))

embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: embedding_matrix[i] = embedding_vector

0 讨论(0)

误落风尘

2021-01-29 22:45
I suggest using gensim to do everything. You can read the file, and also benefit from having a lot of methods already implemented on this great package.

Suppose you generated GloVe vectors using the C++ program and that your "-save-file" parameter is "vectors". Glove executable will generate you two files, "vectors.bin" and "vectors.txt".

Use glove2word2vec to convert GloVe vectors in text format into the word2vec text format:
```
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file="vectors.txt", word2vec_output_file="gensim_glove_vectors.txt")
```
Finally, read the word2vec txt to a gensim model using KeyedVectors:
```
from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)
```
Now you can use gensim word2vec methods (for example, similarity) as you'd like.
0 讨论(0)
发布评论:

提交评论
- 加载中...

终归单人心

2021-01-29 22:46

This code takes some time to store glove embeddings on shelf, but loading it is quite faster as compared to other approaches.

import os
import numpy as np
from contextlib import closing
import shelve

def store_glove_to_shelf(glove_file_path,shelf):
    print('Loading Glove')
    with open(os.path.join(glove_file_path)) as f:
        for line in f:
            values = line.split()
            word = values[0]
            vec = np.asarray(values[1:], dtype='float32')
            shelf[word] = vec

shelf_file_name = "glove_embeddings"
glove_file_path = "glove/glove.840B.300d.txt"

# Storing glove embeddings to shelf for faster load
with closing(shelve.open(shelf_file_name + '.shelf', 'c')) as shelf:
    store_glove_to_shelf(glove_file_path,shelf)
    print("Stored glove embeddings from {} to {}".format(glove_file_path,shelf_file_name+'.shelf'))

# To reuse the glove embeddings stored in shelf
with closing(shelve.open(shelf_file_name + '.shelf')) as embeddings_index:
    # USE embeddings_index here , which is a dictionary
    print("Loaded glove embeddings from {}".format(shelf_file_name+'.shelf'))
    print("Found glove embeddings with {} words".format(len(embeddings_index)))

0 讨论(0)

面向向阳花

2021-01-29 22:49

You can do it much faster with pandas:

import pandas as pd
import csv

words = pd.read_table(glove_data_file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

Then to get the vector for a word:

def vec(w):
  return words.loc[w].as_matrix()

And to find the closest word to a vector:

words_matrix = words.as_matrix()

def find_closest_word(v):
  diff = words_matrix - v
  delta = np.sum(diff * diff, axis=1)
  i = np.argmin(delta)
  return words.iloc[i].name

0 讨论(0)

上一页 1 2