cosine-similarity

How to efficiently compute similarity between documents in a stream of documents

十年热恋 提交于 2019-12-03 14:12:40
I gather Text documents (in Node.js) where one document i is represented as a list of words. What is an efficient way to compute the similarity between these documents, taking into account that new documents are coming as a sort of stream of documents? I currently use cos-similarity on the Normalized Frequency of the words within each document. I don't use the TF-IDF (Term frequency, Inverse document frequency) because of the scalability issue since I get more and more documents. Initially My first version was to start with the currently available documents, compute a big Term-Document matrix

Cosine distance as vector distance function for k-means

我只是一个虾纸丫 提交于 2019-12-03 11:52:16
I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited. E.g. for the graph: the vector: v1 = {100, 50, 0 30, 0} would mean that we spent: 100secs at vertex 1 50secs at vertex 2 and 30secs at vertex 4 (vertices 3 & 5 where not visited, thus the 0s). I want to run a k-means clustering and I've chosen cosine_distance = 1 - cosine_similarity as the metric for the distances, where the formula for cosine

Python: MemoryError when computing tf-idf cosine similarity between two columns in Pandas

梦想与她 提交于 2019-12-03 05:04:12
I'm trying to compute the tf-idf vector cosine similarity between two columns in a Pandas dataframe. One column contains a search query, the other contains a product title. The cosine similarity value is intended to be a "feature" for a search engine/ranking machine learning algorithm. I'm doing this in an iPython notebook and am unfortunately running into MemoryErrors and am not sure why after a few hours of digging. My setup: Lenovo E560 laptop Core i7-6500U @ 2.50 GHz 16 GB Ram Windows 10 Using the anaconda 3.5 kernel with a fresh update of all libraries I've tested my code/goal on a small

why two vectors is not similarity but result is 1?

只谈情不闲聊 提交于 2019-12-02 13:29:46
I'm using Cosine Similarity formula to caculate similarity between two vectors. I tried two different vectors like this: Vector1(-1237373741, 27, 1, 1, 331289590, 1818540802) Vector2(-1237373741, 49, 1, 1, 331289590, 1818540802) Two vectors has a little different, but the result is 1 . I don't know why? Anyone can explain this problem for me? thanks so much. For the most part, those two vectors are are pointing in the same direction (The larger coordinates are going to dominate the smaller differences in the other coordinate). A cosine similarity of ~1 is expected (Remember that cos(0) = 1) 来源

Problems with pySpark Columnsimilarities

久未见 提交于 2019-12-02 08:44:42
tl;dr How do I use pySpark to compare the similarity of rows? I have a numpy array where I would like to compare the similarities of each row to one another print (pdArray) #[[ 0. 1. 0. ..., 0. 0. 0.] # [ 0. 0. 3. ..., 0. 0. 0.] # [ 0. 0. 0. ..., 0. 0. 7.] # ..., # [ 5. 0. 0. ..., 0. 1. 0.] # [ 0. 6. 0. ..., 0. 0. 3.] # [ 0. 0. 0. ..., 2. 0. 0.]] Using scipy I can compute cosine similarities as follow... pyspark.__version__ # '2.2.0' from sklearn.metrics.pairwise import cosine_similarity similarities = cosine_similarity(pdArray) similarities.shape # (475, 475) print(similarities) array([[ 1

RAKE with GENSIM

[亡魂溺海] 提交于 2019-12-02 03:57:35
I am trying to calculate similarity. First of all i used RAKE library to extract the keywords from the crawled jobs. Then I put the keywords of every jobs into separate array and then combined all those arrays into documentArray. documentArray = ['Anger command,Assertiveness,Approachability,Adaptability,Authenticity,Aggressiveness,Analytical thinking,Molecular Biology,Molecular Biology,Molecular Biology,molecular biology,molecular biology,Master,English,Molecular Biology,,Islamabad,Islamabad District,Islamabad Capital Territory,Pakistan,,Rawalpindi,Rawalpindi,Punjab,Pakistan'"], ['competitive

Mahout rowSimilarity

半腔热情 提交于 2019-12-02 02:43:34
I am trying to compute row similarity between wikipedia documents. I have the tf-idf vectors in format Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable . I am following the quick tour of text analysis from here: https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line I created a mahout matrix as follows: mahout rowid \ -i wikipedia-vectors/tfidf-vectors/part-r-00000 -o wikipedia-matrix I got the the number generated rows and columns: vectors.RowIdJob: Wrote out matrix with 4587604 rows and

DBSCAN error with cosine metric in python

泪湿孤枕 提交于 2019-12-01 17:16:27
I was trying to use DBSCAN algorithm from scikit-learn library with cosine metric but was stuck with the error. The line of code is db = DBSCAN(eps=1, min_samples=2, metric='cosine').fit(X) where X is a csr_matrix . The error is the following: Metric 'cosine' not valid for algorithm 'auto', though the documentation says that it is possible to use this metric. I tried to use option algorithm='kd_tree' and 'ball_tree' but got the same. However, there is no error if I use euclidean or, say, l1 metric. The matrix X is large, so I can't use a precomputed matrix of pairwise distances. I use python 2

DBSCAN error with cosine metric in python

拥有回忆 提交于 2019-12-01 16:55:30
问题 I was trying to use DBSCAN algorithm from scikit-learn library with cosine metric but was stuck with the error. The line of code is db = DBSCAN(eps=1, min_samples=2, metric='cosine').fit(X) where X is a csr_matrix . The error is the following: Metric 'cosine' not valid for algorithm 'auto', though the documentation says that it is possible to use this metric. I tried to use option algorithm='kd_tree' and 'ball_tree' but got the same. However, there is no error if I use euclidean or, say, l1

Spark cosine distance between rows using Dataframe

喜你入骨 提交于 2019-11-30 15:33:56
I have to compute a cosine distance between each rows but I have no idea how to do it using Spark API Dataframes elegantly. The idea is to compute similarities for each rows(items) and take top 10 similarities by comparing their similarities between rows. --> This is need for Item-Item Recommender System. All that I've read about it is referred to computing similarity over columns Apache Spark Python Cosine Similarity over DataFrames May someone say is it possible to compute a cosine distance elegantly between rows using PySpark Dataframe's API or RDD's or I have to do it manually? That's just