So, lets say I have 100,000 float arrays with 100 elements each. I need the highest X number of values, BUT only if they are greater than Y. Any element not matching this
Using numpy
:
# assign zero to all elements less than or equal to `lowValY`
a[a<=lowValY] = 0
# find n-th largest element in the array (where n=highCountX)
x = partial_sort(a, highCountX, reverse=True)[:highCountX][-1]
#
a[a<x] = 0 #NOTE: it might leave more than highCountX non-zero elements
# . if there are duplicates
Where partial_sort
could be:
def partial_sort(a, n, reverse=False):
#NOTE: in general it should return full list but in your case this will do
return sorted(a, reverse=reverse)[:n]
The expression a[a<value] = 0
can be written without numpy
as follows:
for i, x in enumerate(a):
if x < value:
a[i] = 0
You can use map and lambda, it should be fast enough.
new_array = map(lambda x: x if x>y else 0, array)
The simplest way would be:
topX = sorted([x for x in array if x > lowValY], reverse=True)[highCountX-1]
print [x if x >= topX else 0 for x in array]
In pieces, this selects all the elements greater than lowValY
:
[x for x in array if x > lowValY]
This array only contains the number of elements greater than the threshold. Then, sorting it so the largest values are at the start:
sorted(..., reverse=True)
Then a list index takes the threshold for the top highCountX
elements:
sorted(...)[highCountX-1]
Finally, the original array is filled out using another list comprehension:
[x if x >= topX else 0 for x in array]
There is a boundary condition where there are two or more equal elements that (in your example) are 3rd highest elements. The resulting array will contain that element more than once.
There are other boundary conditions as well, such as if len(array) < highCountX
. Handling such conditions is left to the implementor.