similarity

Cosine similarity yields 'nan' values

回眸只為那壹抹淺笑 提交于 2020-01-04 07:27:43
问题 I was calculating a Cosine Similarity Matrix for sparse vectors, and the elements expected to be float numbers appeared to be 'nan'. 'visits' is a sparse matrix showing how many times each user has visited each website. This matrix used to have a shape 1 500 000 x 1500, but I converted it into sparse matrix, using coo_matrix().tocsc(). The task is to find out, how similar the websites are, so I decided to calculate the cosine metric between each two sites. Here is my code: cosine_distance

How to calculate the similarity of two line drawing images in swift

房东的猫 提交于 2020-01-03 08:24:09
问题 We need to compare two hand drawn images..these images are drawn on the sprite kit.we need to see whether these pictures are roughly match or not. For Example, if someone draws a smile pic, we need to check whether the redrawing smile pic is looks like the first drawn smile pic or not.we need to know whether the two images look alike or not...and to calculate the accuracy percentage of how similar they are..Please suggest some solutions.Thanks in advance. 回答1: You could try draw each of the

How to calculate the similarity for all the rows in a table in R?

删除回忆录丶 提交于 2020-01-03 02:22:14
问题 I would like to calculate the similarity (Numerical measure of how alike 2 data objects are - in this case, how alike 2 rows are) of each row in a table, and the table will be like: vhigh,vhigh,2,2,small,low,unacc vhigh,vhigh,2,2,small,med,unacc vhigh,vhigh,2,2,small,high,unacc vhigh,vhigh,2,2,med,low,unacc vhigh,vhigh,2,2,med,med,unacc vhigh,vhigh,2,2,med,high,unacc vhigh,vhigh,2,2,big,low,unacc vhigh,vhigh,2,2,big,med,unacc vhigh,vhigh,2,2,big,high,unacc I tried many different ways on the

Hash function that hashes similar strings in the same bucket

拜拜、爱过 提交于 2020-01-02 14:12:02
问题 I'm searching for a "bad" hash function: I'd like to hash strings and put similar strings in one bucket. Can you give me a hint where to start my research? Some methods or algorithm names... 回答1: Your problem is not an easy one. Two ideas: This solution might be overly complicated but you could try a Fourier transform. Treat your input text as a series of samples of a function and then run a Fourier transform to convert your input to the frequency domain. The low frequency part is the general

Hash function that hashes similar strings in the same bucket

徘徊边缘 提交于 2020-01-02 14:11:26
问题 I'm searching for a "bad" hash function: I'd like to hash strings and put similar strings in one bucket. Can you give me a hint where to start my research? Some methods or algorithm names... 回答1: Your problem is not an easy one. Two ideas: This solution might be overly complicated but you could try a Fourier transform. Treat your input text as a series of samples of a function and then run a Fourier transform to convert your input to the frequency domain. The low frequency part is the general

Similarity matrix -> feature vectors algorithm?

余生长醉 提交于 2020-01-02 03:56:05
问题 If we have a set of M words, and know the similarity of the meaning of each pair of words in advance (have a M x M matrix of similarities), which algorithm can we use to make one k-dimensional bit vector for each word, so that each pair of words can be compared just by comparing their vectors (e.g. getting the absolute difference of vectors)? I don't know how this particular problem is called. If I knew, it would be much easier to find among a bunch of algorithms with similar descriptions,

Matching two series of Mfcc coefficients

时光毁灭记忆、已成空白 提交于 2020-01-01 19:04:18
问题 I have extracted two series MFCC coefficients from two around 30 second audio files consisting of the same speech content. The audio files are recorded at the same location from different sources. An estimation should be made whether the audio contains the same conversation or a different conversation. Currently I have tested a correlation calculation of the two Mfcc series but the result is not very reasonable. Are there best practices for this scenario? 回答1: I had the same problem and the

Pyspark calculate custom distance between all vectors in a RDD

限于喜欢 提交于 2020-01-01 16:45:32
问题 I have a RDD consisting of dense vectors which contain probability distribution like below [DenseVector([0.0806, 0.0751, 0.0786, 0.0753, 0.077, 0.0753, 0.0753, 0.0777, 0.0801, 0.0748, 0.0768, 0.0764, 0.0773]), DenseVector([0.2252, 0.0422, 0.0864, 0.0441, 0.0592, 0.0439, 0.0433, 0.071, 0.1644, 0.0405, 0.0581, 0.0528, 0.0691]), DenseVector([0.0806, 0.0751, 0.0786, 0.0753, 0.077, 0.0753, 0.0753, 0.0777, 0.0801, 0.0748, 0.0768, 0.0764, 0.0773]), DenseVector([0.0924, 0.0699, 0.083, 0.0706, 0.0766,

strategies for finding duplicate mailing addresses

人盡茶涼 提交于 2020-01-01 03:28:06
问题 I'm trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses: addr_1 = '# 3 FAIRMONT LINK SOUTH' addr_2 = '3 FAIRMONT LINK S' addr_3 = '5703 - 48TH AVE' adrr_4 = '5703- 48 AVENUE' I'm planning on applying some string transformation to make long words abbreviated, like NORTH -> N, remove all spaces, commas and dashes and pound symbols. Now, having this output, how can I compare addr_3 with the rest of addresses and detect

How do I determine the longest similar portion of several strings?

别说谁变了你拦得住时间么 提交于 2020-01-01 02:47:17
问题 As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings. Example: file:///home/gms8994/Music/t.A.T.u./ file:///home/gms8994/Music/nina%20sky/ file:///home/gms8994/Music/A%20Perfect%20Circle/ Ideally, I'd get back file:///home/gms8994/Music/ , because that's the longest portion that's common for all 3 strings. Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would