Python calculate lots of distances quickly

后端 未结 4 1092
佛祖请我去吃肉
佛祖请我去吃肉 2021-02-06 08:02

I have an input of 36,742 points which means if I wanted to calculate the lower triangle of a distance matrix (using the vincenty approximation) I would need to generate 36,742*

相关标签:
4条回答
  • 2021-02-06 08:42

    "Use some kind of hashing to get a quick rough-cut off (all stores within 100km) and then only calculate accurate distances between those stores" I think this might be better called gridding. So first make a dict, with a set of coords as the key and put each shop in a 50km bucket near that point. then when you are calculating distances, you only look in nearby buckets, rather than iterate through each shop in the whole universe

    0 讨论(0)
  • 2021-02-06 08:45

    Have you tried mapping entire arrays and functions instead of iterating through them? An example would be as follows:

    from numpy.random import rand
    
    my_array = rand(int(5e7), 1)  # An array of 50,000,000 random numbers in double.
    

    Now what is normally done is:

    squared_list_iter = [value**2 for value in my_array]
    

    Which of course works, but is optimally invalid.

    The alternative would be to map the array with a function. This is done as follows:

    func = lambda x: x**2  # Here is what I want to do on my array.
    
    squared_list_map = map(func, test)  # Here I am doing it!
    

    Now, one might ask, how is this any different, or even better for that matter? Since now we have added a call to a function, too! Here is your answer:

    For the former solution (via iteration):

    1 loop: 1.11 minutes.
    

    Compared to the latter solution (mapping):

    500 loop, on average 560 ns. 
    

    Simultaneous conversion of a map() to list by list(map(my_list)) would increase the time by a factor of 10 to approximately 500 ms.

    You choose!

    0 讨论(0)
  • 2021-02-06 08:47

    Thanks everyone's help. I think I have solved this by incorporating all the suggestions.

    I use numpy to import the geographic co-ordinates and then project them using "France Lambert - 93". This lets me fill scipy.spatial.cKDTree with the points and then calculate a sparse_distance_matrix by specifying a cut-off of 50km (my projected points are in metres). I then extract extract the lower-triangle to a CSV.

    import numpy as np
    import csv
    import time
    from pyproj import Proj, transform
    
    #http://epsg.io/2154 (accuracy: 1.0m)
    fr = '+proj=lcc +lat_1=49 +lat_2=44 +lat_0=46.5 +lon_0=3 \
    +x_0=700000 +y_0=6600000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 \
    +units=m +no_defs'
    
    #http://epsg.io/27700-5339 (accuracy: 1.0m)
    uk = '+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 \
    +x_0=400000 +y_0=-100000 +ellps=airy \
    +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs'
    
    path_to_csv = '.../raw_in.csv'
    out_csv = '.../out.csv'
    
    def proj_arr(points):
        inproj = Proj(init='epsg:4326')
        outproj = Proj(uk)
        # origin|destination|lon|lat
        func = lambda x: transform(inproj,outproj,x[2],x[1])
        return np.array(list(map(func, points)))
    
    tstart = time.time()
    
    # Import points as geographic coordinates
    # ID|lat|lon
    #Sample to try and replicate
    #points = np.array([
    #        [39007,46.585012,5.5857829],
    #        [88086,48.192370,6.7296289],
    #        [62627,50.309155,3.0218611],
    #        [14020,49.133972,-0.15851507],
    #        [1091, 42.981765,2.0104902]])
    #
    points = np.genfromtxt(path_to_csv,
                           delimiter=',',
                           skip_header=1)
    
    print("Total points: %d" % len(points))
    print("Triangular matrix contains: %d" % (len(points)*((len(points))-1)*0.5))
    # Get projected co-ordinates
    proj_pnts = proj_arr(points)
    
    # Fill quad-tree
    from scipy.spatial import cKDTree
    tree = cKDTree(proj_pnts)
    cut_off_metres = 1600
    tree_dist = tree.sparse_distance_matrix(tree,
                                            max_distance=cut_off_metres,
                                            p=2) 
    
    # Extract triangle
    from scipy import sparse
    udist = sparse.tril(tree_dist, k=-1)    # zero the main diagonal
    print("Distances after quad-tree cut-off: %d " % len(udist.data))
    
    # Export CSV
    import csv
    f = open(out_csv, 'w', newline='') 
    w = csv.writer(f, delimiter=",", )
    w.writerow(['id_a','lat_a','lon_a','id_b','lat_b','lon_b','metres'])
    w.writerows(np.column_stack((points[udist.row ],
                                 points[udist.col],
                                 udist.data)))
    f.close()
    
    """
    Get ID labels
    """
    id_to_csv = '...id.csv'
    id_labels = np.genfromtxt(id_to_csv,
                           delimiter=',',
                           skip_header=1,
                           dtype='U')
    
    """
    Try vincenty on the un-projected co-ordinates
    """
    from geopy.distance import vincenty
    vout_csv = '.../out_vin.csv'
    test_vin = np.column_stack((points[udist.row].T[1:3].T,
                                points[udist.col].T[1:3].T))
    
    func = lambda x: vincenty(x[0:2],x[2:4]).m
    output = list(map(func,test_vin))
    
    # Export CSV
    f = open(vout_csv, 'w', newline='')
    w = csv.writer(f, delimiter=",", )
    w.writerow(['id_a','id_a2', 'lat_a','lon_a',
                'id_b','id_b2', 'lat_b','lon_b',
                'proj_metres','vincenty_metres'])
    w.writerows(np.column_stack((list(id_labels[udist.row]),
                                 points[udist.row ],
                                 list(id_labels[udist.col]),
                                 points[udist.col],
                                 udist.data,
                                 output,
                                 )))
    
    f.close()    
    print("Finished in %.0f seconds" % (time.time()-tstart)
    

    This approach took 164 seconds to generate (for 5,306,434 distances) - compared to 9 - and also around 90 seconds to save to disk.

    I then compared the difference in the vincenty distance and the hypotenuse distance (on the projected co-ordinates).

    The mean difference in metres was 2.7 and the mean difference/metres was 0.0073% - which looks great.

    0 讨论(0)
  • 2021-02-06 08:59

    This sounds like a classic use case for k-D trees.

    If you first transform your points into Euclidean space then you can use the query_pairs method of scipy.spatial.cKDTree:

    from scipy.spatial import cKDTree
    
    tree = cKDTree(data)
    # where data is (nshops, ndim) containing the Euclidean coordinates of each shop
    # in units of km
    
    pairs = tree.query_pairs(50, p=2)   # 50km radius, L2 (Euclidean) norm
    

    pairs will be a set of (i, j) tuples corresponding to the row indices of pairs of shops that are ≤50km from each other.


    The output of tree.sparse_distance_matrix is a scipy.sparse.dok_matrix. Since the matrix will be symmetric and you're only interested in unique row/column pairs, you could use scipy.sparse.tril to zero out the upper triangle, giving you a scipy.sparse.coo_matrix. From there you can access the nonzero row and column indices and their corresponding distance values via the .row, .col and .data attributes:

    from scipy import sparse
    
    tree_dist = tree.sparse_distance_matrix(tree, max_distance=10000, p=2)
    udist = sparse.tril(tree_dist, k=-1)    # zero the main diagonal
    ridx = udist.row    # row indices
    cidx = udist.col    # column indices
    dist = udist.data   # distance values
    
    0 讨论(0)
提交回复
热议问题