问题
I am a bit new to gensim and right now I am trying to solve the problem which involves using the doc2vec embeddings in keras. I wasn't able to find existing implementation of doc2vec in keras - as far as I see in all examples I found so far everyone just uses the gensim to get the document embeddings.
Once I trained my doc2vec model in gensim I need to export embeddings weights from genim into keras somehow and it is not really clear on how to do that. I see that
model.syn0
Supposedly gives the word2vec embedding weights (according to this). But it is unclear how to do the same export for document embeddings. Any advise?
I know that in general I can just get the embeddings for each document directly from gensim model but I want to fine-tune the embedding layer in keras later on, since doc embeddings will be used as a part of a larger task hence they might be fine-tuned a bit.
回答1:
I figured this out.
Assuming you already trained the gensim model and used string tags as document ids:
#get vector of doc
model.docvecs['2017-06-24AEON']
#raw docvectors (all of them)
model.docvecs.doctag_syn0
#docvector names in model
model.docvecs.offset2doctag
You can export this doc vectors into keras embedding layer as below, assuming your DataFrame df has all of the documents out there. Notice that in the embedding matrix you need to pass only integers as inputs. I use raw number in dataframe as the id of the doc for input. Also notice that embedding layer requires to not touch index 0 - it is reserved for masking, so when I pass the doc id as input to my network I need to ensure it is >0
#creating embedding matrix
embedding_matrix = np.zeros((len(df)+1, text_encode_dim))
for i, row in df.iterrows():
embedding = modelDoc2Vec.docvecs[row['docCode']]
embedding_matrix[i+1] = embedding
#input with id of document
doc_input = Input(shape=(1,),dtype='int16', name='doc_input')
#embedding layer intialized with the matrix created earlier
embedded_doc_input = Embedding(output_dim=text_encode_dim, input_dim=len(df)+1,weights=[embedding_matrix], input_length=1, trainable=False)(doc_input)
UPDATE
After late 2017, with the introduction of Keras 2.0 API very last line should be changed to:
embedded_doc_input = Embedding(output_dim=text_encode_dim, input_dim=len(df)+1,embeddings_initializer=Constant(embedding_matrix), input_length=1, trainable=False)(doc_input)
来源:https://stackoverflow.com/questions/48999199/export-gensim-doc2vec-embeddings-into-separate-file-to-use-with-keras-embedding