I am trying to calculate similarity. First of all i used RAKE library to extract the keywords from the crawled jobs. Then I put the keywords of every jobs into separate array and then combined all those arrays into documentArray.
documentArray = ['Anger command,Assertiveness,Approachability,Adaptability,Authenticity,Aggressiveness,Analytical thinking,Molecular Biology,Molecular Biology,Molecular Biology,molecular biology,molecular biology,Master,English,Molecular Biology,,Islamabad,Islamabad District,Islamabad Capital Territory,Pakistan,,Rawalpindi,Rawalpindi,Punjab,Pakistan'"], ['competitive compensation,assay design,positive attitude,regular basis,motivate others,meetings related,improve state,travel on,phd degree,meeting abstracts,benefits package,daily basis,scientific papers,application notes']
queryStr = 'In Vitro,Biochemistry,PCR,Western Blotting,Neuroscience,Molecular Biology,Cell biology,Immunohistochemistry,Microscopy,Animal Models,Presentations,Immunoprecipitation,Cell biology,Master's Degree,Bachelor's Degree,,,,,'
Then I wrote the following GENSIM code,
class Gensim:
def __init__(self): print("Init") def calculateGensimSimilarity(self, texts, query): dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2) lda = models.LdaModel(corpus, id2word=dictionary, num_topics=2) index_lsi = similarities.MatrixSimilarity(lsi[corpus]) index_lda = similarities.MatrixSimilarity(lda[corpus]) vec_bow = dictionary.doc2bow(query.lower().split()) vec_lsi = lsi[vec_bow] vec_lda = lda[vec_bow] print("LSI Model") sims_lsi = index_lsi[vec_lsi] print("LDA Model") print(sims_lsi) sims_lda = index_lda[vec_lda] print(sims_lda)
It is printing LSA score 0 and LDA score 90%+ match. Kindly let me know where I am wrong and how can i modify to calculate the correct cosine similarity.
LSA Score[ 0. 0.] LDA Score[ 0.94234258 0.9477495 ]
来源:https://stackoverflow.com/questions/40436110/rake-with-gensim