I need to sort a VERY large genomic dataset using numpy. I have an array of 2.6 billion floats, dimensions = (868940742, 3)
which takes up about 20GB of memory on m
At the moment each call to np.argsort
is generating a (868940742, 1)
array of int64 indices, which will take up ~7 GB just by itself. Additionally, when you use these indices to sort the columns of full_arr
you are generating another (868940742, 1)
array of floats, since fancy indexing always returns a copy rather than a view.
One fairly obvious improvement would be to sort full_arr
in place using its .sort() method. Unfortunately, .sort()
does not allow you to directly specify a row or column to sort by. However, you can specify a field to sort by for a structured array. You can therefore force an inplace sort over one of the three columns by getting a view onto your array as a structured array with three float fields, then sorting by one of these fields:
full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)
In this case I'm sorting full_arr
in place by the 0th field, which corresponds to the first column. Note that I've assumed that there are three float64 columns ('f8'
) - you should change this accordingly if your dtype is different. This also requires that your array is contiguous and in row-major format, i.e. full_arr.flags.C_CONTIGUOUS == True
.
Credit for this method should go to Joe Kington for his answer here.
Although it requires less memory, sorting a structured array by field is unfortunately much slower compared with using np.argsort
to generate an index array, as you mentioned in the comments below (see this previous question). If you use np.argsort
to obtain a set of indices to sort by, you might see a modest performance gain by using np.take
rather than direct indexing to get the sorted array:
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
x[idx]
# 1 loops, best of 100: 148 µs per loop
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
np.take(x, idx, axis=0)
# 1 loops, best of 100: 42.9 µs per loop
However I wouldn't expect to see any difference in terms of memory usage, since both methods will generate a copy.
Regarding your question about why sorting the second array is faster - yes, you should expect any reasonable sorting algorithm to be faster when there are fewer unique values in the array because on average there's less work for it to do. Suppose I have a random sequence of digits between 1 and 10:
5 1 4 8 10 2 6 9 7 3
There are 10! = 3628800 possible ways to arrange these digits, but only one in which they are in ascending order. Now suppose there are just 5 unique digits:
4 4 3 2 3 1 2 5 1 5
Now there are 2⁵ = 32 ways to arrange these digits in ascending order, since I could swap any pair of identical digits in the sorted vector without breaking the ordering.
By default, np.ndarray.sort()
uses Quicksort. The qsort variant of this algorithm works by recursively selecting a 'pivot' element in the array, then reordering the array such that all the elements less than the pivot value are placed before it, and all of the elements greater than the pivot value are placed after it. Values that are equal to the pivot are already sorted. Having fewer unique values means that, on average, more values will be equal to the pivot value on any given sweep, and therefore fewer sweeps are needed to fully sort the array.
For example:
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 10, 100000)
x.sort()
# 1 loops, best of 100: 2.3 ms per loop
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 1000, 100000)
x.sort()
# 1 loops, best of 100: 4.62 ms per loop
In this example the dtypes of the two arrays are the same. If your smaller array has a smaller item size compared with the larger array then the cost of copying it due to the fancy indexing will also be smaller.