问题
I have seen:
- how do I find the closest value to a given number in an array?
- How do I find the closest array element to an arbitrary (non-member) number?.
These relate to vanilla python and not pandas.
If I have the series:
ix num
0 1
1 6
2 4
3 5
4 2
And I input 3, how can I (efficiently) find?
- The index of 3 if it is found in the series
- The index of the value below and above 3 if it is not found in the series.
Ie. With the above series {1,6,4,5,2}, and input 3, I should get values (4,2) with indexes (2,4).
回答1:
You could use argsort()
like
Say, input = 3
In [198]: input = 3
In [199]: df.ix[(df['num']-input).abs().argsort()[:2]]
Out[199]:
num
2 4
4 2
df_sort
is the dataframe with 2 closest values.
In [200]: df_sort = df.ix[(df['num']-input).abs().argsort()[:2]]
For index,
In [201]: df_sort.index.tolist()
Out[201]: [2, 4]
For values,
In [202]: df_sort['num'].tolist()
Out[202]: [4, 2]
Detail, for the above solution df
was
In [197]: df
Out[197]:
num
0 1
1 6
2 4
3 5
4 2
回答2:
I recommend using iloc
in addition to John Galt's answer since this will work even with unsorted integer index, since .ix first looks at the index labels
df.iloc[(df['num']-input).abs().argsort()[:2]]
回答3:
Apart from not completely answering the question, an extra disadvantage of the other algorithms discussed here is that they have to sort the entire list. This results in a complexity of ~N log(N).
However, it is possible to achieve the same results in ~N. This approach separates the dataframe in two subsets, one smaller and one larger than the desired value. The lower neighbour is than the largest value in the smaller dataframe and vice versa for the upper neighbour.
This gives the following code snippet:
def find_neighbours(value):
exactmatch=df[df.num==value]
if !exactmatch.empty:
return exactmatch.index[0]
else:
lowerneighbour_ind = df[df.num<value].idxmax()
upperneighbour_ind = df[df.num>value].idxmin()
return lowerneighbour_ind, upperneighbour_ind
This approach is similar to using partition in pandas, which can be really useful when dealing with large datasets and complexity becomes an issue.
Comparing both strategies shows that for large N, the partitioning strategy is indeed faster. For small N, the sorting strategy will be more efficient, as it is implemented at a much lower level. It is also a one-liner, which might increase code readability.
The code to replicate this plot can be seen below:
from matplotlib import pyplot as plt
import pandas
import numpy
import timeit
value=3
sizes=numpy.logspace(2, 5, num=50, dtype=int)
sort_results, partition_results=[],[]
for size in sizes:
df=pandas.DataFrame({"num":100*numpy.random.random(size)})
sort_results.append(timeit.Timer("df.iloc[(df['num']-value).abs().argsort()[:2]].index",
globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())
partition_results.append(timeit.Timer('find_neighbours(df,value)',
globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())
sort_time=[time/amount for amount,time in sort_results]
partition_time=[time/amount for amount,time in partition_results]
plt.plot(sizes, sort_time)
plt.plot(sizes, partition_time)
plt.legend(['Sorting','Partitioning'])
plt.title('Comparison of strategies')
plt.xlabel('Size of Dataframe')
plt.ylabel('Time in s')
plt.savefig('speed_comparison.png')
回答4:
If your series is already sorted, you could use something like this.
def closest(df, col, val, direction):
n = len(df[df[col] <= val])
if(direction < 0):
n -= 1
if(n < 0 or n >= len(df)):
print('err - value outside range')
return None
return df.ix[n, col]
df = pd.DataFrame(pd.Series(range(0,10,2)), columns=['num'])
for find in range(-1, 2):
lc = closest(df, 'num', find, -1)
hc = closest(df, 'num', find, 1)
print('Closest to {} is {}, lower and {}, higher.'.format(find, lc, hc))
df: num
0 0
1 2
2 4
3 6
4 8
err - value outside range
Closest to -1 is None, lower and 0, higher.
Closest to 0 is 0, lower and 2, higher.
Closest to 1 is 0, lower and 2, higher.
回答5:
If the series is already sorted, an efficient method of finding the indexes is by using bisect functions. An example:
idx = bisect_left(df['num'].values, 3)
Let's consider that the column col
of the dataframe df
is sorted.
- In the case where the value
val
is in the column,bisect_left
will return the precise index of the value in the list andbisect_right
will return the index of the next position. - In the case where the value is not in the list, both
bisect_left
andbisect_right
will return the same index: the one where to insert the value to keep the list sorted.
Hence, to answer the question, the following code gives the index of val
in col
if it is found, and the indexes of the closest values otherwise. This solution works even when the values in the list are not unique.
from bisect import bisect_left, bisect_right
def get_closests(df, col, val):
lower_idx = bisect_left(df[col].values, val)
higher_idx = bisect_right(df[col].values, val)
if higher_idx == lower_idx: #val is not in the list
return lower_idx - 1, lower_idx
else: #val is in the list
return lower_idx
Bisect algorithms are very efficient to find the index of the specific value "val" in the dataframe column "col", or its closest neighbours, but it requires the list to be sorted.
来源:https://stackoverflow.com/questions/30112202/how-do-i-find-the-closest-values-in-a-pandas-series-to-an-input-number