Calculate weighted pairwise distance matrix in Python

白昼怎懂夜的黑 提交于 2021-02-06 20:01:48

问题


I am trying to find the fastest way to perform the following pairwise distance calculation in Python. I want to use the distances to rank a list_of_objects by their similarity.

Each item in the list_of_objects is characterised by four measurements a, b, c, d, which are made on very different scales e.g.:

object_1 = [0.2, 4.5, 198, 0.003]
object_2 = [0.3, 2.0, 999, 0.001]
object_3 = [0.1, 9.2, 321, 0.023]
list_of_objects = [object_1, object_2, object_3]

The aim is to get a pairwise distance matrix of the objects in list_of_objects. However, I want to be able to specify the 'relative importance' of each measurement in my distance calculation via a weights vector with one weight per measurement, e.g.:

weights = [1, 1, 1, 1]

would indicate that all measurements are equally weighted. In this case I want each measurement to contribute equally to the distance between objects, regardless of the measurement scale. Alternatively:

weights = [1, 1, 1, 10]

would indicate that I want measurement d to contribute 10x more than the other measurements to the distance between objects.

My current algorithm looks like this:

  1. Calculate a pairwise distance matrix for each measurement
  2. Normalise each distance matrix so that the maximum is 1
  3. Multiply each distance matrix by the appropriate weight from weights
  4. Sum the distance matrices to generate a single pairwise matrix
  5. Use the matrix from 4 to provide a ranked list of pairs of objects from list_of_objects

This works fine, and gives me a weighted version of the city-block distance between objects.

I have two questions:

  1. Without changing the algorithm, what's the fastest implementation in SciPy, NumPy or SciKit-Learn to perform the initial distance matrix calculations.

  2. Is there an existing multi-dimensional distance approach that does all of this for me?

For Q 2, I have looked, but couldn't find anything with a built-in step that does the 'relative importance' in the way that I want.

Other suggestions welcome. Happy to clarify if I've missed details.


回答1:


scipy.spatial.distance is the module you'll want to have a look at. It has a lot of different norms that can be easily applied.

I'd recommend using the weighted Monkowski Metrik

Weighted Minkowski Metrik

You can do pairwise distance calculation by using the pdist method from this package.

E.g.

import numpy as np
from scipy.spatial.distance import pdist, wminkowski, squareform

object_1 = [0.2, 4.5, 198, 0.003]
object_2 = [0.3, 2.0, 999, 0.001]
object_3 = [0.1, 9.2, 321, 0.023]
list_of_objects = [object_1, object_2, object_3]

# make a 3x4 array from the list of objects
X = np.array(list_of_objects)

#calculate pairwise distances, using weighted Minkowski norm
distances = pdist(X,wminkowski,2, [1,1,1,10])

#make a square matrix from result
distances_as_2d_matrix = squareform(distances)

print distances
print distances_as_2d_matrix

This will print

[ 801.00390786  123.0899671   678.0382942 ]
[[   0.          801.00390786  123.0899671 ]
 [ 801.00390786    0.          678.0382942 ]
 [ 123.0899671   678.0382942     0.        ]]



回答2:


The normalization step, where you divide pairwise distances by the max value, seems non-standard, and may make it hard to find a ready-made function that will do exactly what you are after. It is pretty easy though to do it yourself. A starting point is to turn your list_of_objects into an array:

>>> obj_arr = np.array(list_of_objects)
>>> obj_arr.shape
(3L, 4L)

You can then get the pairwise distances using broadcasting. This is a little inefficient, because it is not taking advantage of the symettry of your metric, and is calculating every distance twice:

>>> dists = np.abs(obj_arr - obj_arr[:, None])
>>> dists.shape
(3L, 3L, 4L)

Normalizing is very easy to do:

>>> dists /= dists.max(axis=(0, 1))

And your final weighing can be done in a variety of ways, you may want to benchmark which is fastest:

>>> dists.dot([1, 1, 1, 1])
array([[ 0.        ,  1.93813131,  2.21542674],
       [ 1.93813131,  0.        ,  3.84644195],
       [ 2.21542674,  3.84644195,  0.        ]])
>>> np.einsum('ijk,k->ij', dists, [1, 1, 1, 1])
array([[ 0.        ,  1.93813131,  2.21542674],
       [ 1.93813131,  0.        ,  3.84644195],
       [ 2.21542674,  3.84644195,  0.        ]])


来源:https://stackoverflow.com/questions/20089007/calculate-weighted-pairwise-distance-matrix-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!