cosine-similarity | 易学教程

Calculating tf-idf among documents using python 2.7

阅读更多关于 Calculating tf-idf among documents using python 2.7

问题 I have a scenario where i have retreived information/raw data from the internet and placed them into their respective json or .txt files. From there on i would like to calculate the frequecies of each term in each document and their cosine similarity by using tf-idf. For example: there are 50 different documents/texts files that consists 5000 words/strings each i would like to take the first word from the first document/text and compare all the total 250000 words find its frequencies then do

Using csr_matrix of items similarities to get most similar items to item X without having to transform csr_matrix to dense matrix

阅读更多关于 Using csr_matrix of items similarities to get most similar items to item X without having to transform csr_matrix to dense matrix

问题 I have a purchase data ( df_temp ). I managed to replace using Pandas Dataframe to using a sparse csr_matrix because I have lots of products (89000) which I have to get their user-item information (purchased or not purchased) and then calculate the similarities between products. First, I converted Pandas DataFrame to Numpy array: df_user_product = df_temp[['user_id','product_id']].copy() ar1 = np.array(df_user_product.to_records(index=False)) Second, created a coo_matrix because it's known

Python: Cosine similarity between two large numpy arrays

阅读更多关于 Python: Cosine similarity between two large numpy arrays

问题 I have two numpy arrays: Array 1 : 500,000 rows x 100 cols Array 2 : 160,000 rows x 100 cols I would like to find the largest cosine similarity between each row in Array 1 and Array 2 . In other words, I compute the cosine similarities between the first row in Array 1 and all the rows in Array 2, and find the maximum cosine similarity, and then I compute the cosine similarities between the second row in Array 1 and all the rows in Array 2, and find the maximum cosine similarity; and do this

pairwise comparisons within a dataset

阅读更多关于 pairwise comparisons within a dataset

问题 My data is 18 vectors each with upto 200 numbers but some with 5 or other numbers.. organised as: [2, 3, 35, 63, 64, 298, 523, 624, 625, 626, 823, 824] [2, 752, 753, 808, 843] [2, 752, 753, 843] [2, 752, 753, 808, 843] [3, 36, 37, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, ...] I would like to find the pair that is the most similar in this group of lists. The numbers

spacy similarity method doesn't not work correctly

阅读更多关于 spacy similarity method doesn't not work correctly

问题 I always get a lot of help from stack overflows. Thank you all the time. I am doing simple natural language processing using spacy . I'm working on filtering out words by measuring the similarity between words. I wrote and used the following simple code shown in the spacy documentation, but the result does not look like a documentation. import spacy nlp = spacy.load('en_core_web_lg') tokens = nlp('dog cat banana') for token1 in tokens: for token2 in tokens: sim = token1.similarity(token2)

Calculate cosine similarity of two matrices - Python

阅读更多关于 Calculate cosine similarity of two matrices - Python

问题 I have defined two matrices like following: from scipy import linalg, mat, dot a = mat([-0.711,0.730]) b = mat([-1.099,0.124]) Now, I want to calculate the cosine similarity of these two matrices . What is the wrong with following code. It gives me an error of objects are not aligned c = dot(a,b)/np.linalg.norm(a)/np.linalg.norm(b) 回答1: You cannot multiply 1x2 matrix by 1x2 matrix. In order to calculate dot product between their rows the second one has to be transposed. from scipy import

Why does scikit-learn's Nearest Neighbor doesn't seem to return proper cosine similarity distances?

阅读更多关于 Why does scikit-learn's Nearest Neighbor doesn't seem to return proper cosine similarity distances?

问题 I am trying to use scikit's Nearest Neighbor implementation to find the closest column vectors to a given column vector, out of a matrix of random values. This code is supposed to find the nearest neighbors of column 21 then check the actual cosine similarity of those neighbors against column 21. from sklearn.neighbors import NearestNeighbors import sklearn.metrics.pairwise as smp import numpy as np test=np.random.randint(0,5,(50,50)) nbrs = NearestNeighbors(n_neighbors=5, algorithm='auto',

Cosine distance as vector distance function for k-means

阅读更多关于 Cosine distance as vector distance function for k-means

问题 I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited. E.g. for the graph: the vector: v1 = {100, 50, 0 30, 0} would mean that we spent: 100secs at vertex 1 50secs at vertex 2 and 30secs at vertex 4 (vertices 3 & 5 where not visited, thus the 0s). I want to run a k-means clustering and I've

Python: MemoryError when computing tf-idf cosine similarity between two columns in Pandas

阅读更多关于 Python: MemoryError when computing tf-idf cosine similarity between two columns in Pandas

问题 I'm trying to compute the tf-idf vector cosine similarity between two columns in a Pandas dataframe. One column contains a search query, the other contains a product title. The cosine similarity value is intended to be a "feature" for a search engine/ranking machine learning algorithm. I'm doing this in an iPython notebook and am unfortunately running into MemoryErrors and am not sure why after a few hours of digging. My setup: Lenovo E560 laptop Core i7-6500U @ 2.50 GHz 16 GB Ram Windows 10

Problems with pySpark Columnsimilarities

阅读更多关于 Problems with pySpark Columnsimilarities

问题 tl;dr How do I use pySpark to compare the similarity of rows? I have a numpy array where I would like to compare the similarities of each row to one another print (pdArray) #[[ 0. 1. 0. ..., 0. 0. 0.] # [ 0. 0. 3. ..., 0. 0. 0.] # [ 0. 0. 0. ..., 0. 0. 7.] # ..., # [ 5. 0. 0. ..., 0. 1. 0.] # [ 0. 6. 0. ..., 0. 0. 3.] # [ 0. 0. 0. ..., 2. 0. 0.]] Using scipy I can compute cosine similarities as follow... pyspark.__version__ # '2.2.0' from sklearn.metrics.pairwise import cosine_similarity