how do I cluster a list of geographic points by distance?

前端 未结 3 1094
天命终不由人
天命终不由人 2021-01-02 21:29

I have a list of points P=[p1,...pN] where pi=(latitudeI,longitudeI).

Using Python 3, I would like to find a smallest set of clusters (disjoint subsets of P) such th

3条回答
  •  野趣味
    野趣味 (楼主)
    2021-01-02 22:13

    This might be a start. the algorithm attempts to k means cluster the points by iterating k from 2 to the number of points validating each solution along the way. You should pick the lowest number.

    It works by clustering the points and then checking that each cluster obeys the constraint. If any cluster is not compliant the solution is labeled as False and we move on to the next number of clusters.

    Because the K-means algorithm used in sklearn falls into local minima, proving whether or not this is the solution you're looking for is the best one is still to be established, but it could be one

    import numpy as np
    from sklearn.cluster import KMeans
    from scipy.spatial.distance import cdist
    import math
    
    points = np.array([[33.    , 41.    ],
           [33.9693, 41.3923],
           [33.6074, 41.277 ],
           [34.4823, 41.919 ],
           [34.3702, 41.1424],
           [34.3931, 41.078 ],
           [34.2377, 41.0576],
           [34.2395, 41.0211],
           [34.4443, 41.3499],
           [34.3812, 40.9793]])
    
    
    def distance(origin, destination): #found here https://gist.github.com/rochacbruno/2883505
        lat1, lon1 = origin[0],origin[1]
        lat2, lon2 = destination[0],destination[1]
        radius = 6371 # km
        dlat = math.radians(lat2-lat1)
        dlon = math.radians(lon2-lon1)
        a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
            * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
        c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
        d = radius * c
    
        return d
    
    def create_clusters(number_of_clusters,points):
        kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(points)
        l_array = np.array([[label] for label in kmeans.labels_])
        clusters = np.append(points,l_array,axis=1)
        return clusters
    
    def validate_solution(max_dist,clusters):
        _, __, n_clust = clusters.max(axis=0)
        n_clust = int(n_clust)
        for i in range(n_clust):
            two_d_cluster=clusters[clusters[:,2] == i][:,np.array([True, True, False])]
            if not validate_cluster(max_dist,two_d_cluster):
                return False
            else:
                continue
        return True
    
    def validate_cluster(max_dist,cluster):
        distances = cdist(cluster,cluster, lambda ori,des: int(round(distance(ori,des))))
        print(distances)
        print(30*'-')
        for item in distances.flatten():
            if item > max_dist:
                return False
        return True
    
    if __name__ == '__main__':
        for i in range(2,len(points)):
            print(i)
            print(validate_solution(20,create_clusters(i,points)))
    

    Once a benchmark established one would have to focus more one each cluster to establish whether its' points could be distributed to others without violating the distance constraint.

    You can replace the lambda function in cdist with whatever distance metric you chose, I found the great circle distance in the repo i mentioned.

提交回复
热议问题