问题
Number of rows in my dataset is 500000+. I need Hausdorff distance of every id
between itself and others. and repeat it for the whole dataset
I have a huge data set. Here is the small part:
df =
id_easy ordinal latitude longitude epoch day_of_week
0 aaa 1.0 22.0701 2.6685 01-01-11 07:45 Friday
1 aaa 2.0 22.0716 2.6695 01-01-11 07:45 Friday
2 aaa 3.0 22.0722 2.6696 01-01-11 07:46 Friday
3 bbb 1.0 22.1166 2.6898 01-01-11 07:58 Friday
4 bbb 2.0 22.1162 2.6951 01-01-11 07:59 Friday
5 ccc 1.0 22.1166 2.6898 01-01-11 07:58 Friday
6 ccc 2.0 22.1162 2.6951 01-01-11 07:59 Friday
I want to calculate Haudorff Distance:
import pandas as pd
import numpy as np
from scipy.spatial.distance import directed_hausdorff
from scipy.spatial.distance import pdist, squareform
u = np.array([(2.6685,22.0701),(2.6695,22.0716),(2.6696,22.0722)]) # coordinates of `id_easy` of taxi `aaa`
v = np.array([(2.6898,22.1166),(2.6951,22.1162)]) # coordinates of `id_easy` of taxi `bbb`
directed_hausdorff(u, v)[0]
Output is 0.05114626086039758
Now I want to calculate this distance for the whole dataset. For all id_easy
s. Desired output is matrix with 0
on diagonal (because distance between the aaa
and aaa
is 0
):
aaa bbb ccc
aaa 0 0.05114 ...
bbb ... 0
ccc 0
回答1:
At first I define a method which provides some sample data. It would be a lot easier if you provide something like that in the question. In most performance related problems the size of the real problem is needed to find a optimal solution.
In the following answer I will assume that the average size of id_easy
is 17 and there are 30000 different ids which results in a data set size of 510_000.
Create sample data
import numpy as np
import numba as nb
N_ids=30_000
av_id_size=17
#create_data (pre sorting according to id assumed)
lat_lon=np.random.rand(N_ids*av_id_size,2)
#create_ids (sorted array with ids)
ids=np.empty(N_ids*av_id_size,dtype=np.int64)
ind=0
for i in range(N_ids):
for j in range(av_id_size):
ids[i*av_id_size+j]=ind
ind+=1
Hausdorff function
The following function is a slightly modified version from scipy-source. The following modifications are made:
- For very small input arrays I commented out the shuffling part (Enable shuffling on larger arrays and try out on your real data what's best
- At least on Windows the Anaconda scipy function looks to have some performance issues (much slower, than on Linux), LLVM based Numba looks to be consitent
- Indices of the Hausdorff pair removed
Distance loop unrolled for the (N,2) case
#Modified Code from Scipy-source #https://github.com/scipy/scipy/blob/master/scipy/spatial/_hausdorff.pyx #Copyright (C) Tyler Reddy, Richard Gowers, and Max Linke, 2016 #Copyright © 2001, 2002 Enthought, Inc. #All rights reserved. #Copyright © 2003-2013 SciPy Developers. #All rights reserved. #Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: #Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. #Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following #disclaimer in the documentation and/or other materials provided with the distribution. #Neither the name of Enthought nor the names of the SciPy Developers may be used to endorse or promote products derived #from this software without specific prior written permission. #THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, #BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. #IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, #OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; #OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT #(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. @nb.njit() def directed_hausdorff_nb(ar1, ar2): N1 = ar1.shape[0] N2 = ar2.shape[0] data_dims = ar1.shape[1] # Shuffling for very small arrays disbabled # Enable it for larger arrays #resort1 = np.arange(N1) #resort2 = np.arange(N2) #np.random.shuffle(resort1) #np.random.shuffle(resort2) #ar1 = ar1[resort1] #ar2 = ar2[resort2] cmax = 0 for i in range(N1): no_break_occurred = True cmin = np.inf for j in range(N2): # faster performance with square of distance # avoid sqrt until very end # Simplificaten (loop unrolling) for (n,2) arrays d = (ar1[i, 0] - ar2[j, 0])**2+(ar1[i, 1] - ar2[j, 1])**2 if d < cmax: # break out of `for j` loop no_break_occurred = False break if d < cmin: # always true on first iteration of for-j loop cmin = d # always true on first iteration of for-j loop, after that only # if d >= cmax if cmin != np.inf and cmin > cmax and no_break_occurred == True: cmax = cmin return np.sqrt(cmax)
Calculating Hausdorff distance on subsets
@nb.njit(parallel=True)
def get_distance_mat(def_slice,lat_lon):
Num_ids=def_slice.shape[0]-1
out=np.empty((Num_ids,Num_ids),dtype=np.float64)
for i in nb.prange(Num_ids):
ar1=lat_lon[def_slice[i:i+1],:]
for j in range(i,Num_ids):
ar2=lat_lon[def_slice[j:j+1],:]
dist=directed_hausdorff_nb(ar1, ar2)
out[i,j]=dist
out[j,i]=dist
return out
Example and Timings
#def_slice defines the start and end of the slices
_,def_slice=np.unique(ids,return_index=True)
def_slice=np.append(def_slice,ids.shape[0])
%timeit res_1=get_distance_mat(def_slice,lat_lon)
#1min 2s ± 301 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
回答2:
You're talking about calculating 500000^2+ distances. If you calculate 1000 of these distances every second, it will take you 7.93 years to complete your matrix. I'm not sure whether the Hausdorff distance is symmetric, but even if it is, that only saves you a factor of two (3.96 years).
The matrix will also take about a terabyte of memory.
I recommend calculating this only when needed, or if you really need the whole matrix, you'll need to parallelize the calculations. On the bright side, this problem can easily be broken up. For example, with four cores, you can split the problem thusly (in pseudocode):
n = len(u)
m = len(v)
A = hausdorff_distance_matrix(u[:n], v[:m])
B = hausdorff_distance_matrix(u[:n], v[m:])
C = hausdorff_distance_matrix(u[n:], v[:m])
D = hausdorff_distance_matrix(u[n:], v[m:])
results = [[A, B],
[C, D]]
Where hausdorff_distance_matrix(u, v)
returns all distance combinations between u
and v
. You'll probably need to split it into a lot more than four segments though.
What is the application? Can you get away with only calculating these piece-wise as needed?
回答3:
Try using the computed distance from scipy
from scipy.spatial.distance import cdist
hausdorff_distance = cdist(df[['latitude', 'longitude']], df[['latitude', 'longitude']], lambda u, v: directed_hausdorff(u, v)[0])
hausdorff_distance_df = pd.DataFrame(hausdorff_distance)
As a note though, whatever method you will end up using - it will take a lot of time to calculate, just to due to the sheer volume of the data. Ask yourself, if you genuinely need every single pair of distances.
Practically, these kind of problems are solved by restricting the number of pairings to a manageable number. For example slice your data frame into smaller sets, with each set restricted to a geographical area and then find the pair of distances within that geographical area.
The above approach is used by supermarkets to identify spots for their new stores. They are not calculating a pair of distances between every single store they own and their competitors own. First they restrict the area, which will have only 5-10 stores in total and only then they proceed to calculate the distances.
回答4:
It comes down to your exact use case:
- If what you want is to pre-compute values because you need ballpark answers quickly, you can 'bin' your answers. In the most rough case, imagine you have 1.2343 * 10^7 and 9.2342 * 10^18. You drop the resolution markedly and do a straight lookup. That is, you lookup '1E7 and 1E19' in your precomputed lookup table. You get the correct answer accurate to a power of 10.
- If you want to characterize your distribution of answers globally, pick a statistically sufficient number of random pairs and use that instead.
- If you want to graph your answers, run successive passes of more precise answers. For example, make your graph from 50x50 points equally spaced. After they are computed, if the user is still there, start filling in 250x250 resolution, and so on. There are cute tricks to deciding where to focus more computational power, just follow the user's mouse instead.
- If you may want the answers for some pieces, but just don't know which one's until a later computation, then make the evaluation lazy. That is, attempting to
__getitem__()
the value actually computes it. - If you think you actually need all the answers now, think harder. There is no problem with that many moving parts. You should find some exploitable constraint in the data that already exists, otherwise the data would already be overwhelming.
Keep thinking! Keep hacking! Keep notes.
来源:https://stackoverflow.com/questions/58856749/hausdorff-distance-for-large-dataset-in-a-fastest-way