How to group sets by similarity in contained elements

一曲冷凌霜 提交于 2020-01-06 19:25:32

问题


I am using Python 2.7. I have routes which are composed of arrays of nodes that connect to each other. The nodes are identified by a string key, but for ease I will use numbers:

sample_route = [1,2,3,4,7] 
#obviously over-simplified; real things would be about 20-40 elements long

I will create a set made up of tuple pairs of point connections using zip, which will end up like:

set([(1,2),(2,3),(3,4),(4,7)])

I will need a way to filter out some routes that are very similar (like one or two added nodes) and use the minimal route out of those near-duplicates. My plan right now is to:

Start with the first (probably optimal) route. Iterate through the rest of the routes, and use the following formula to calculate its similarity to the sequence in step 1:

matching = len(value1.difference(value2)) + len(value2.difference(value1))
#value1, value2 = two compared sets

The lower the number, the more similar. But what is a good way to group the routes based on their similarity to other routes? They will all be different lengths. I have never taken a statistics course.

Example:

sets = [
  set([(1,2),(2,3),(3,4),(4,5),(5,10)]),
  set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]),
  set([(1,7),(7,3),(3,8),(8,7),(7,6),(6,5),(5,10)]),
  set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]),
  set([(1,9),(9,4),(4,5),(5,10)]),
  set([(1,9),(9,4),(4,6),(6,5),(5,10)])
]

In this example, the groupings may be something like [[1,2,4],[3],[5,6]], where 1, 2, and 4 are very similar, 5 and 6 are similar, and 3 is nowhere near any of the others. 1 to 2 would have a score of 2, and 3 to 6 would have a score of 8, as examples. This is the sort of data I am using (though these are easy to read simplifications).

There is a time benefit in this. If I can remove redundant routes, I will trim off a considerable amount of time.


回答1:


I would recommend looking into the networkx package. It allows you to create directed graphs such as you are describing. For measuring the similarity of 2 routes, I would recommend the Jaccard similarity index. Here is code showing the example you illustrated.

First, import a few libraries: graphs, plotting, and numeric python. Then build the directed graph by adding nodes numbered 1 to 8. Build the connections from node to node to build your path. The networkx package has built-in capability to find all paths in the graph from one node to another: nx.all_simple_paths(g, start_node, end_node).

Once you have all of the paths, you can compute the a matrix, J, of the Jaccard similarity between the paths. How you actually want to cluster the paths by their similarity is up to you.

import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

g = nx.DiGraph()
g.add_nodes_from(range(1,8))
g.add_edges_from([(1,2),(2,3),(3,4),(4,7)]) #path 1,2,3,4,7
g.add_edges_from([(4,5),(5,7)])             #path 1,2,3,4,5,7
g.add_edges_from([(4,6),(6,7)])             #path 1,2,3,4,6,7
paths_iter = nx.all_simple_paths(g,1,7)
paths = [p for p in paths]

np.random.seed(100000)
nx.draw_spring(g, with_labels=True)
plt.show()

def jaccard(v1, v2):
    return (len(np.intersect1d(v1,v2))+0.0)/len(np.union1d(v1,v2))

J = np.zeros([len(paths),len(paths)])
for i in range(J.shape[0]):
    for j in range(i, J.shape[1]):
        J[i,j] = J[j,i] = jaccard(paths[i],paths[j])

print J

> [[ 1.          0.71428571  0.83333333]
>  [ 0.71428571  1.          0.83333333]
>  [ 0.83333333  0.83333333  1.        ]]

EDIT Since you have a metric to compare the similarity of the paths to each other, look into k-means clustering to cluster the paths together.

from scipy.cluster.vq import kmeans2

I don't have enough of your code or your data to help anymore from this point.



来源:https://stackoverflow.com/questions/30813923/how-to-group-sets-by-similarity-in-contained-elements

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!