cosine-similarity

Calculating tf-idf among documents using python 2.7

非 Y 不嫁゛ 提交于 2019-12-29 08:08:27
问题 I have a scenario where i have retreived information/raw data from the internet and placed them into their respective json or .txt files. From there on i would like to calculate the frequecies of each term in each document and their cosine similarity by using tf-idf. For example: there are 50 different documents/texts files that consists 5000 words/strings each i would like to take the first word from the first document/text and compare all the total 250000 words find its frequencies then do

Using csr_matrix of items similarities to get most similar items to item X without having to transform csr_matrix to dense matrix

∥☆過路亽.° 提交于 2019-12-24 19:07:03
问题 I have a purchase data ( df_temp ). I managed to replace using Pandas Dataframe to using a sparse csr_matrix because I have lots of products (89000) which I have to get their user-item information (purchased or not purchased) and then calculate the similarities between products. First, I converted Pandas DataFrame to Numpy array: df_user_product = df_temp[['user_id','product_id']].copy() ar1 = np.array(df_user_product.to_records(index=False)) Second, created a coo_matrix because it's known

Python: Cosine similarity between two large numpy arrays

依然范特西╮ 提交于 2019-12-24 10:30:48
问题 I have two numpy arrays: Array 1 : 500,000 rows x 100 cols Array 2 : 160,000 rows x 100 cols I would like to find the largest cosine similarity between each row in Array 1 and Array 2 . In other words, I compute the cosine similarities between the first row in Array 1 and all the rows in Array 2, and find the maximum cosine similarity, and then I compute the cosine similarities between the second row in Array 1 and all the rows in Array 2, and find the maximum cosine similarity; and do this

pairwise comparisons within a dataset

点点圈 提交于 2019-12-24 06:35:06
问题 My data is 18 vectors each with upto 200 numbers but some with 5 or other numbers.. organised as: [2, 3, 35, 63, 64, 298, 523, 624, 625, 626, 823, 824] [2, 752, 753, 808, 843] [2, 752, 753, 843] [2, 752, 753, 808, 843] [3, 36, 37, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, ...] I would like to find the pair that is the most similar in this group of lists. The numbers

spacy similarity method doesn't not work correctly

落爺英雄遲暮 提交于 2019-12-24 04:52:11
问题 I always get a lot of help from stack overflows. Thank you all the time. I am doing simple natural language processing using spacy . I'm working on filtering out words by measuring the similarity between words. I wrote and used the following simple code shown in the spacy documentation, but the result does not look like a documentation. import spacy nlp = spacy.load('en_core_web_lg') tokens = nlp('dog cat banana') for token1 in tokens: for token2 in tokens: sim = token1.similarity(token2)

Calculate cosine similarity of two matrices - Python

≡放荡痞女 提交于 2019-12-23 11:14:08
问题 I have defined two matrices like following: from scipy import linalg, mat, dot a = mat([-0.711,0.730]) b = mat([-1.099,0.124]) Now, I want to calculate the cosine similarity of these two matrices . What is the wrong with following code. It gives me an error of objects are not aligned c = dot(a,b)/np.linalg.norm(a)/np.linalg.norm(b) 回答1: You cannot multiply 1x2 matrix by 1x2 matrix. In order to calculate dot product between their rows the second one has to be transposed. from scipy import

Why does scikit-learn's Nearest Neighbor doesn't seem to return proper cosine similarity distances?

匆匆过客 提交于 2019-12-21 16:48:36
问题 I am trying to use scikit's Nearest Neighbor implementation to find the closest column vectors to a given column vector, out of a matrix of random values. This code is supposed to find the nearest neighbors of column 21 then check the actual cosine similarity of those neighbors against column 21. from sklearn.neighbors import NearestNeighbors import sklearn.metrics.pairwise as smp import numpy as np test=np.random.randint(0,5,(50,50)) nbrs = NearestNeighbors(n_neighbors=5, algorithm='auto',

Cosine distance as vector distance function for k-means

醉酒当歌 提交于 2019-12-21 03:41:32
问题 I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited. E.g. for the graph: the vector: v1 = {100, 50, 0 30, 0} would mean that we spent: 100secs at vertex 1 50secs at vertex 2 and 30secs at vertex 4 (vertices 3 & 5 where not visited, thus the 0s). I want to run a k-means clustering and I've

Python: MemoryError when computing tf-idf cosine similarity between two columns in Pandas

瘦欲@ 提交于 2019-12-20 15:30:11
问题 I'm trying to compute the tf-idf vector cosine similarity between two columns in a Pandas dataframe. One column contains a search query, the other contains a product title. The cosine similarity value is intended to be a "feature" for a search engine/ranking machine learning algorithm. I'm doing this in an iPython notebook and am unfortunately running into MemoryErrors and am not sure why after a few hours of digging. My setup: Lenovo E560 laptop Core i7-6500U @ 2.50 GHz 16 GB Ram Windows 10

Problems with pySpark Columnsimilarities

↘锁芯ラ 提交于 2019-12-20 07:23:33
问题 tl;dr How do I use pySpark to compare the similarity of rows? I have a numpy array where I would like to compare the similarities of each row to one another print (pdArray) #[[ 0. 1. 0. ..., 0. 0. 0.] # [ 0. 0. 3. ..., 0. 0. 0.] # [ 0. 0. 0. ..., 0. 0. 7.] # ..., # [ 5. 0. 0. ..., 0. 1. 0.] # [ 0. 6. 0. ..., 0. 0. 3.] # [ 0. 0. 0. ..., 2. 0. 0.]] Using scipy I can compute cosine similarities as follow... pyspark.__version__ # '2.2.0' from sklearn.metrics.pairwise import cosine_similarity