Gensim Word2Vec select minor set of word vectors from pretrained model

问题

I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model.

The problem is that the embedding size is enormous and I don't need most of the word vectors (because I know which words can occure as Input). So I want to get rid of them to reduce the size of my embedding layer.

Is there a way to just keep desired wordvectors (including the coresponding indices!), based on a whitelist of words?

回答1:

Thanks to this answer (I've changed the code a little bit to make it better). you can use this code for solving your problem.

we have all our minor set of words in restricted_word_set(it can be either list or set) and w2v is our model, so here is the function:

import numpy as np

def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = np.array(new_vectors)
    w2v.index2entity = np.array(new_index2entity)
    w2v.index2word = np.array(new_index2entity)
    w2v.vectors_norm = np.array(new_vectors_norm)

WARNING: when you first create the model the vectors_norm == None so you will get an error if you use this function there. vectors_norm will get a value of the type numpy.ndarray after the first use. so before using the function try something like most_similar("cat") so that vectors_norm not be equal to None.

It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.

Usage:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")

[('beers', 0.8409687876701355),
('lager', 0.7733745574951172),
('Beer', 0.71753990650177),
('drinks', 0.668931245803833),
('lagers', 0.6570086479187012),
('Yuengling_Lager', 0.655455470085144),
('microbrew', 0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372),
('suds', 0.6497018337249756),
('brewed_beer', 0.6490240097045898)]

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")

[('lagers', 0.6570085287094116),
('wine', 0.6217695474624634),
('bash', 0.20583480596542358),
('computer', 0.06677375733852386),
('python', 0.005948573350906372)]

it can be used for removing some words either.

回答2:

There's no built-in feature that does exactly that, but it shouldn't require much code, and could be modeled on existing gensim code. A few possible alternative strategies:

Load the full vectors, then save in an easy-to-parse format - such as via .save_word2vec_format(..., binary=False). This format is nearly self-explanatory; write your own code to drop all lines from this file that aren't on your whitelist (being sure to update the leading line declaration of entry-count). The existing source code for load_word2vec_format() & save_word2vec_format() may be instructive. You'll then have a subset file.
Or, pretend you were going to train a new Word2Vec model, using your corpus-of-interest (with just the interesting words). But, only create the model and do the build_vocab() step. Now, you have untrained model, with random vectors, but just the right vocabulary. Grab the model's wv property - a KeyedVectors instance with that right vocabulary. Then separately load the oversized vector-set, and for each word in the right-sized KeyedVectors, copy over the actual vector from the larger set. Then save the right-sized subset.
Or, look at the (possibly-broken-since-gensim-3.4) method on Word2Vec intersect_word2vec_format(). It more-or-less tries to do what's described in (2) above: with an in-memory model that has the vocabulary you want, merge in just the overlapping words from another word2vec-format set on disk. It'll either work, or provide the template for what you'd want to do.

来源：https://stackoverflow.com/questions/50914729/gensim-word2vec-select-minor-set-of-word-vectors-from-pretrained-model

标签

python

keras

word2vec

gensim

word-embedding