word-embedding | 易学教程

What is “unk” in the pretrained GloVe vector files (e.g. glove.6B.50d.txt)?

阅读更多关于 What is “unk” in the pretrained GloVe vector files (e.g. glove.6B.50d.txt)?

问题 I found "unk" token in the glove vector file glove.6B.50d.txt downloaded from https://nlp.stanford.edu/projects/glove/. Its value is as follows: unk -0.79149 0.86617 0.11998 0.00092287 0.2776 -0.49185 0.50195 0.00060792 -0.25845 0.17865 0.2535 0.76572 0.50664 0.4025 -0.0021388 -0.28397 -0.50324 0.30449 0.51779 0.01509 -0.35031 -1.1278 0.33253 -0.3525 0.041326 1.0863 0.03391 0.33564 0.49745 -0.070131 -1.2192 -0.48512 -0.038512 -0.13554 -0.1638 0.52321 -0.31318 -0.1655 0.11909 -0.15115 -0.15621

Ensure the gensim generate the same Word2Vec model for different runs on the same data

阅读更多关于 Ensure the gensim generate the same Word2Vec model for different runs on the same data

问题 In LDA model generates different topics everytime i train on the same corpus , by setting the np.random.seed(0) , the LDA model will always be initialized and trained in exactly the same way. Is it the same for the Word2Vec models from gensim ? By setting the random seed to a constant, would the different run on the same dataset produce the same model? But strangely, it's already giving me the same vector at different instances. >>> from nltk.corpus import brown >>> from gensim.models import

What is “unk” in the pretrained GloVe vector files (e.g. glove.6B.50d.txt)?

阅读更多关于 What is “unk” in the pretrained GloVe vector files (e.g. glove.6B.50d.txt)?

I found "unk" token in the glove vector file glove.6B.50d.txt downloaded from https://nlp.stanford.edu/projects/glove/ . Its value is as follows: unk -0.79149 0.86617 0.11998 0.00092287 0.2776 -0.49185 0.50195 0.00060792 -0.25845 0.17865 0.2535 0.76572 0.50664 0.4025 -0.0021388 -0.28397 -0.50324 0.30449 0.51779 0.01509 -0.35031 -1.1278 0.33253 -0.3525 0.041326 1.0863 0.03391 0.33564 0.49745 -0.070131 -1.2192 -0.48512 -0.038512 -0.13554 -0.1638 0.52321 -0.31318 -0.1655 0.11909 -0.15115 -0.15621 -0.62655 -0.62336 -0.4215 0.41873 -0.92472 1.1049 -0.29996 -0.0063003 0.3954 Is it a token to be used

How to get word vectors from Keras Embedding Layer

阅读更多关于 How to get word vectors from Keras Embedding Layer

I'm currently working with a Keras model which has a embedding layer as first layer. In order to visualize the relationships and similarity of words between each other I need a function that returns the mapping of words and vectors of every element in the vocabulary (e.g. 'love' - [0.21, 0.56, ..., 0.65, 0.10]). Is there any way to do it? You can get the word embeddings by using the get_weights() method of the embedding layer (i.e. essentially the weights of an embedding layer are the embedding vectors): # if you have access to the embedding layer explicitly embeddings = emebdding_layer.get

Why are multiple model files created in gensim word2vec?

阅读更多关于 Why are multiple model files created in gensim word2vec?

When I try to create a word2vec model (skipgram with negative sampling) I received 3 files as output as follows. word2vec (File) word2vec.syn1nef.npy (NPY file) word2vec.wv.syn0.npy (NPY file) I am just worried why this happens as for my previous test examples in word2vec I only received one model(no npy files). Please help me. Models with larger internal vector-arrays can't be saved via Python 'pickle' to a single file, so beyond a certain threshold, the gensim save() method will store subsidiary arrays in separate files, using the more-efficient raw format of numpy arrays ( .npy format). You

What does tf.nn.embedding_lookup function do?

阅读更多关于 What does tf.nn.embedding_lookup function do?

tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None) I cannot understand the duty of this function. Is it like a lookup table? Which means to return the parameters corresponding to each id (in ids)? For instance, in the skip-gram model if we use tf.nn.embedding_lookup(embeddings, train_inputs) , then for each train_input it finds the correspond embedding? Rafał Józefowicz embedding_lookup function retrieves rows of the params tensor. The behavior is similar to using indexing with arrays in numpy. E.g. matrix = np.random.random([1024, 64]) # 64-dimensional embeddings ids =

How does Keras 1d convolution layer work with word embeddings - text classification problem? (Filters, kernel size, and all hyperparameter)

阅读更多关于 How does Keras 1d convolution layer work with word embeddings - text classification problem? (Filters, kernel size, and all hyperparameter)

I am currently developing a text classification tool using Keras. It works (it works fine and I got up to 98.7 validation accuracy) but I can't wrap my head around about how exactly 1D-convolution layer works with text data. What hyper-parameters should I use? I have the following sentences (input data): Maximum words in the sentence: 951 (if it's less - the paddings are added) Vocabulary size: ~32000 Amount of sentences (for training): 9800 embedding_vecor_length: 32 (how many relations each word has in word embeddings) batch_size: 37 (it doesn't matter for this question) Number of labels

Update only part of the word embedding matrix in Tensorflow

阅读更多关于 Update only part of the word embedding matrix in Tensorflow

Assuming that I want to update a pre-trained word-embedding matrix during training, is there a way to update only a subset of the word embedding matrix? I have looked into the Tensorflow API page and found this: # Create an optimizer. opt = GradientDescentOptimizer(learning_rate=0.1) # Compute the gradients for a list of variables. grads_and_vars = opt.compute_gradients(loss, <list of variables>) # grads_and_vars is a list of tuples (gradient, variable). Do whatever you # need to the 'gradient' part, for example cap them, etc. capped_grads_and_vars = [(MyCapper(gv[0]), gv[1])) for gv in grads

Why are multiple model files created in gensim word2vec?

阅读更多关于 Why are multiple model files created in gensim word2vec?

问题 When I try to create a word2vec model (skipgram with negative sampling) I received 3 files as output as follows. word2vec (File) word2vec.syn1nef.npy (NPY file) word2vec.wv.syn0.npy (NPY file) I am just worried why this happens as for my previous test examples in word2vec I only received one model(no npy files). Please help me. 回答1: Models with larger internal vector-arrays can't be saved via Python 'pickle' to a single file, so beyond a certain threshold, the gensim save() method will store

How does mask_zero in Keras Embedding layer work?

阅读更多关于 How does mask_zero in Keras Embedding layer work?

问题 I thought mask_zero=True will output 0's when the input value is 0, so the following layers could skip computation or something. How does mask_zero works? Example: data_in = np.array([ [1, 2, 0, 0] ]) data_in.shape >>> (1, 4) # model x = Input(shape=(4,)) e = Embedding(5, 5, mask_zero=True)(x) m = Model(inputs=x, outputs=e) p = m.predict(data_in) print(p.shape) print(p) The actual output is: (the numbers are random) (1, 4, 5) [[[ 0.02499047 0.04617121 0.01586803 0.0338897 0.009652 ] [ 0