word-embedding

What is “unk” in the pretrained GloVe vector files (e.g. glove.6B.50d.txt)?

回眸只為那壹抹淺笑 提交于 2019-11-30 08:10:24
问题 I found "unk" token in the glove vector file glove.6B.50d.txt downloaded from https://nlp.stanford.edu/projects/glove/. Its value is as follows: unk -0.79149 0.86617 0.11998 0.00092287 0.2776 -0.49185 0.50195 0.00060792 -0.25845 0.17865 0.2535 0.76572 0.50664 0.4025 -0.0021388 -0.28397 -0.50324 0.30449 0.51779 0.01509 -0.35031 -1.1278 0.33253 -0.3525 0.041326 1.0863 0.03391 0.33564 0.49745 -0.070131 -1.2192 -0.48512 -0.038512 -0.13554 -0.1638 0.52321 -0.31318 -0.1655 0.11909 -0.15115 -0.15621

Ensure the gensim generate the same Word2Vec model for different runs on the same data

爷,独闯天下 提交于 2019-11-30 04:46:35
问题 In LDA model generates different topics everytime i train on the same corpus , by setting the np.random.seed(0) , the LDA model will always be initialized and trained in exactly the same way. Is it the same for the Word2Vec models from gensim ? By setting the random seed to a constant, would the different run on the same dataset produce the same model? But strangely, it's already giving me the same vector at different instances. >>> from nltk.corpus import brown >>> from gensim.models import

What is “unk” in the pretrained GloVe vector files (e.g. glove.6B.50d.txt)?

你说的曾经没有我的故事 提交于 2019-11-29 06:03:47
I found "unk" token in the glove vector file glove.6B.50d.txt downloaded from https://nlp.stanford.edu/projects/glove/ . Its value is as follows: unk -0.79149 0.86617 0.11998 0.00092287 0.2776 -0.49185 0.50195 0.00060792 -0.25845 0.17865 0.2535 0.76572 0.50664 0.4025 -0.0021388 -0.28397 -0.50324 0.30449 0.51779 0.01509 -0.35031 -1.1278 0.33253 -0.3525 0.041326 1.0863 0.03391 0.33564 0.49745 -0.070131 -1.2192 -0.48512 -0.038512 -0.13554 -0.1638 0.52321 -0.31318 -0.1655 0.11909 -0.15115 -0.15621 -0.62655 -0.62336 -0.4215 0.41873 -0.92472 1.1049 -0.29996 -0.0063003 0.3954 Is it a token to be used

How to get word vectors from Keras Embedding Layer

若如初见. 提交于 2019-11-29 03:58:07
I'm currently working with a Keras model which has a embedding layer as first layer. In order to visualize the relationships and similarity of words between each other I need a function that returns the mapping of words and vectors of every element in the vocabulary (e.g. 'love' - [0.21, 0.56, ..., 0.65, 0.10]). Is there any way to do it? You can get the word embeddings by using the get_weights() method of the embedding layer (i.e. essentially the weights of an embedding layer are the embedding vectors): # if you have access to the embedding layer explicitly embeddings = emebdding_layer.get

Why are multiple model files created in gensim word2vec?

我与影子孤独终老i 提交于 2019-11-28 12:06:12
When I try to create a word2vec model (skipgram with negative sampling) I received 3 files as output as follows. word2vec (File) word2vec.syn1nef.npy (NPY file) word2vec.wv.syn0.npy (NPY file) I am just worried why this happens as for my previous test examples in word2vec I only received one model(no npy files). Please help me. Models with larger internal vector-arrays can't be saved via Python 'pickle' to a single file, so beyond a certain threshold, the gensim save() method will store subsidiary arrays in separate files, using the more-efficient raw format of numpy arrays ( .npy format). You

What does tf.nn.embedding_lookup function do?

时光毁灭记忆、已成空白 提交于 2019-11-27 16:39:22
tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None) I cannot understand the duty of this function. Is it like a lookup table? Which means to return the parameters corresponding to each id (in ids)? For instance, in the skip-gram model if we use tf.nn.embedding_lookup(embeddings, train_inputs) , then for each train_input it finds the correspond embedding? Rafał Józefowicz embedding_lookup function retrieves rows of the params tensor. The behavior is similar to using indexing with arrays in numpy. E.g. matrix = np.random.random([1024, 64]) # 64-dimensional embeddings ids =

How does Keras 1d convolution layer work with word embeddings - text classification problem? (Filters, kernel size, and all hyperparameter)

倾然丶 夕夏残阳落幕 提交于 2019-11-27 14:33:39
I am currently developing a text classification tool using Keras. It works (it works fine and I got up to 98.7 validation accuracy) but I can't wrap my head around about how exactly 1D-convolution layer works with text data. What hyper-parameters should I use? I have the following sentences (input data): Maximum words in the sentence: 951 (if it's less - the paddings are added) Vocabulary size: ~32000 Amount of sentences (for training): 9800 embedding_vecor_length: 32 (how many relations each word has in word embeddings) batch_size: 37 (it doesn't matter for this question) Number of labels

Update only part of the word embedding matrix in Tensorflow

拥有回忆 提交于 2019-11-27 07:08:55
Assuming that I want to update a pre-trained word-embedding matrix during training, is there a way to update only a subset of the word embedding matrix? I have looked into the Tensorflow API page and found this: # Create an optimizer. opt = GradientDescentOptimizer(learning_rate=0.1) # Compute the gradients for a list of variables. grads_and_vars = opt.compute_gradients(loss, <list of variables>) # grads_and_vars is a list of tuples (gradient, variable). Do whatever you # need to the 'gradient' part, for example cap them, etc. capped_grads_and_vars = [(MyCapper(gv[0]), gv[1])) for gv in grads

Why are multiple model files created in gensim word2vec?

北城以北 提交于 2019-11-27 06:43:59
问题 When I try to create a word2vec model (skipgram with negative sampling) I received 3 files as output as follows. word2vec (File) word2vec.syn1nef.npy (NPY file) word2vec.wv.syn0.npy (NPY file) I am just worried why this happens as for my previous test examples in word2vec I only received one model(no npy files). Please help me. 回答1: Models with larger internal vector-arrays can't be saved via Python 'pickle' to a single file, so beyond a certain threshold, the gensim save() method will store

How does mask_zero in Keras Embedding layer work?

╄→гoц情女王★ 提交于 2019-11-26 23:03:16
问题 I thought mask_zero=True will output 0's when the input value is 0, so the following layers could skip computation or something. How does mask_zero works? Example: data_in = np.array([ [1, 2, 0, 0] ]) data_in.shape >>> (1, 4) # model x = Input(shape=(4,)) e = Embedding(5, 5, mask_zero=True)(x) m = Model(inputs=x, outputs=e) p = m.predict(data_in) print(p.shape) print(p) The actual output is: (the numbers are random) (1, 4, 5) [[[ 0.02499047 0.04617121 0.01586803 0.0338897 0.009652 ] [ 0