pairwise comparisons within a dataset

问题

My data is 18 vectors each with upto 200 numbers but some with 5 or other numbers.. organised as:

[2, 3, 35, 63, 64, 298, 523, 624, 625, 626, 823, 824]
[2, 752, 753, 808, 843]
[2, 752, 753, 843]
[2, 752, 753, 808, 843]
[3, 36, 37, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, ...]

I would like to find the pair that is the most similar in this group of lists. The numbers themselves are not important, they may as well be strings - a 2 in one list and a 3 in another list are not comparable.

I am looking if the variables are the same. for example, the second list is exactly the same as the 4th list but only 1 variable different from list 3.

Additionally it would be nice to also find the most similar triplet or n that are the most similar, but pairwise is the first and most important task.

I hope i have layed out this problem clear enough but i am very happy to supply any more information that anyone might need!

I have a feeling it involves numpy or scipy norm/cosine calculations, but i cant quite work out how to do it, or if this is the best method.

Any help would be greatly appreciated!

回答1:

You can use itertools to generate your pairwise comparisons. If you just want the items which are shared between two lists you can use a set intersection. Using your example:

import itertools

a = [2, 3, 35, 63, 64, 298, 523, 624, 625, 626, 823, 824]
b = [2, 752, 753, 808, 843]
c = [2, 752, 753, 843]
d = [2, 752, 753, 808, 843]
e = [3, 36, 37, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112]

data = [a, b, c, d, e]

def number_same(a, b):
    # Find the items which are the same
    return set(a).intersection(set(b))

for i in itertools.permutations([i for i in range(len(data) - 1)], r=2):
    print "Indexes: ", i, len(number_same(data[i[0]], data[i[1]]))

>>>Indexes  (0, 1) 1
Indexes  (0, 2) 1
Indexes  (0, 3) 1
Indexes  (1, 0) 1
Indexes  (1, 2) 4
Indexes  (1, 3) 5  ... etc

This will give the number of items which are shared between two lists, you could maybe use this information to define which two lists are the best pair...

来源：https://stackoverflow.com/questions/40128515/pairwise-comparisons-within-a-dataset

标签

python

python-2.7

numpy

scipy

cosine-similarity