initialising Seq2seq embedding with pretrained word2vec

前端 未结 2 559
野的像风
野的像风 2021-02-09 03:43

I am interested in initialising tensorflow seq2seq implementation with pretrained word2vec.

I have seen the code. It seems embedding is initialized

with          


        
相关标签:
2条回答
  • 2021-02-09 04:06

    I think you've gotten your answer in the mailing list but I am putting it here for posterity.

    https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/bH6S98NpIJE

    You can initialize it randomly and afterwards do: session.run(embedding.assign(my_word2vec_matrix))

    This will override the init values.

    This seems to work for me. I believe trainable=False is needed to keep the values fixed?

    # load word2vec model (say from gensim)
    model = load_model(FILENAME, binary=True)
    
    # embedding matrix
    X = model.syn0
    print(type(X)) # numpy.ndarray
    print(X.shape) # (vocab_size, embedding_dim)
    
    # start interactive session
    sess = tf.InteractiveSession()
    
    # set embeddings
    embeddings = tf.Variable(tf.random_uniform(X.shape, minval=-0.1, maxval=0.1), trainable=False)
    
    # initialize
    sess.run(tf.initialize_all_variables())
    
    # override inits
    sess.run(embeddings.assign(X))
    
    0 讨论(0)
  • 2021-02-09 04:06

    You can change the tokanizer present in tensorflow/models/rnn/translate/data_utils.py to use a pre-trained word2vec model for tokenizing. The lines 187-190 of data_utils.py:

    if tokenizer:
        words = tokenizer(sentence)
    else:
        words = basic_tokenizer(sentence)
    

    use basic_tokenizer. You can write a tokenizer method that uses a pre-trained word2vec model for tokenizing the sentences.

    0 讨论(0)
提交回复
热议问题