cosine-similarity

cosine similarity on large sparse matrix with numpy

自闭症网瘾萝莉.ら 提交于 2019-12-05 11:31:17
The code below causes my system to run out of memory before it completes. Can you suggest a more efficient means of computing the cosine similarity on a large matrix, such as the one below? I would like to have the cosine similarity computed for each of the 65000 rows in my original matrix ( mat ) relative to all of the others so that the result is a 65000 x 65000 matrix where each element is the cosine similarity between two rows in the original matrix. import numpy as np from scipy import sparse from sklearn.metrics.pairwise import cosine_similarity mat = np.random.rand(65000, 10) sparse_mat

Mahout: adjusted cosine similarity for item based recommender

有些话、适合烂在心里 提交于 2019-12-05 07:48:43
问题 For an assignment I'm supposed to test different types of recommenders, which I have to implement first. I've been looking around for a good library to do that (I had thought about Weka at first) and stumbled upon Mahout. I must therefore put forward that: a) I'm completely new to Mahout b) I do not have a strong background in recommenders nor their algorithms (otherwise I wouldn't be doing this class...) and c) sorry but I'm far from being the best developper in the world ==> I'd appreciate

Calculating cosine similarity by featurizing the text into vector using tf-idf

限于喜欢 提交于 2019-12-04 15:02:27
I'm new to Apache Spark, want to find the similar text from a bunch of text, have tried myself as follows - I have 2 RDD- 1st RDD contain incomplete text as follows - [0,541 Suite 204, Redwood City, CA 94063] [1,6649 N Blue Gum St, New Orleans,LA, 70116] [2,#69, Los Angeles, Los Angeles, CA, 90034] [3,98 Connecticut Ave Nw, Chagrin Falls] [4,56 E Morehead Webb, TX, 78045] 2nd RDD contain correct address as follows - [0,541 Jefferson Avenue, Suite 204, Redwood City, CA 94063] [1,6649 N Blue Gum St, New Orleans, Orleans, LA, 70116] [2,25 E 75th St #69, Los Angeles, Los Angeles, CA, 90034] [3,98

How can I calculate Cosine similarity between two strings vectors

我的未来我决定 提交于 2019-12-04 13:08:20
I have 2 vectors of dimensions 6 and I would like to have a number between 0 and 1. a=c("HDa","2Pb","2","BxU","BuQ","Bve") b=c("HCK","2Pb","2","09","F","G") Can anyone explain what I should do? using the lsa package and the manual for this package # create some files library('lsa') td = tempfile() dir.create(td) write( c("HDa","2Pb","2","BxU","BuQ","Bve"), file=paste(td, "D1", sep="/")) write( c("HCK","2Pb","2","09","F","G"), file=paste(td, "D2", sep="/")) # read files into a document-term matrix myMatrix = textmatrix(td, minWordLength=1) EDIT: show how is the mymatrix object myMatrix

TfIdfVectorizer: How does the vectorizer with fixed vocab deal with new words?

痞子三分冷 提交于 2019-12-04 11:50:39
I'm working on a corpus of ~100k research papers. I'm considering three fields: plaintext title abstract I used the TfIdfVectorizer to get a TfIdf representation of the plaintext field and feed the thereby originated vocab back into the Vectorizers of title and abstract to assure that all three representations are working on the same vocab. My idea was that since the the plaintext field is much bigger than the other two, it's vocab will most probably cover all the words in the other fields. But how would the TfIdfVectorizer deal with new words/tokens if that wasn't the case? Here's an example

SQL Computation of Cosine Similarity

断了今生、忘了曾经 提交于 2019-12-04 09:01:15
Suppose you have a table in a database constructed as follows: create table data (v int, base int, w_td float); insert into data values (99,1,4); insert into data values (99,2,3); insert into data values (99,3,4); insert into data values (1234,2,5); insert into data values (1234,3,2); insert into data values (1234,4,3); To be clear select * from data should output: v |base|w_td -------------- 99 |1 |4.0 99 |2 |3.0 99 |3 |4.0 1234|2 |5.0 1234|3 |2.0 1234|4 |3.0 Note that since the vectors are stored in a database, we need only store the non-zero entries. In this example, we only have two

Why does scikit-learn's Nearest Neighbor doesn't seem to return proper cosine similarity distances?

ε祈祈猫儿з 提交于 2019-12-04 07:00:44
I am trying to use scikit's Nearest Neighbor implementation to find the closest column vectors to a given column vector, out of a matrix of random values. This code is supposed to find the nearest neighbors of column 21 then check the actual cosine similarity of those neighbors against column 21. from sklearn.neighbors import NearestNeighbors import sklearn.metrics.pairwise as smp import numpy as np test=np.random.randint(0,5,(50,50)) nbrs = NearestNeighbors(n_neighbors=5, algorithm='auto', metric=smp.cosine_similarity).fit(test) distances, indices = nbrs.kneighbors(test) x=21 for idx,d in

Mahout rowSimilarity

岁酱吖の 提交于 2019-12-04 04:47:52
问题 I am trying to compute row similarity between wikipedia documents. I have the tf-idf vectors in format Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable . I am following the quick tour of text analysis from here: https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line I created a mahout matrix as follows: mahout rowid \ -i wikipedia-vectors/tfidf-vectors/part-r-00000 -o wikipedia-matrix I

Cosine Similarity

走远了吗. 提交于 2019-12-04 01:54:41
问题 I was reading and came across this formula: The formula is for cosine similarity. I thought this looked interesting and I created a numpy array that has user_id as row and item_id as column. For instance, let M be this matrix: M = [[2,3,4,1,0],[0,0,0,0,5],[5,4,3,0,0],[1,1,1,1,1]] Here the entries inside the matrix are ratings the people u has given to item i based on row u and column i . I want to calculate this cosine similarity for this matrix between items (rows). This should yield a 5 x 5

Mahout: adjusted cosine similarity for item based recommender

六月ゝ 毕业季﹏ 提交于 2019-12-03 20:29:33
For an assignment I'm supposed to test different types of recommenders, which I have to implement first. I've been looking around for a good library to do that (I had thought about Weka at first) and stumbled upon Mahout. I must therefore put forward that: a) I'm completely new to Mahout b) I do not have a strong background in recommenders nor their algorithms (otherwise I wouldn't be doing this class...) and c) sorry but I'm far from being the best developper in the world ==> I'd appreciate if you could use layman terms (as far as possible...) :) I've been following some tutorials (e.g. this