I have about 15 million pairs that consist of a single int, paired with a batch of (2 to 100) other ints.
If it makes a difference, the ints themselve range from 0 to 1
I would do the following:
# create example data
A = np.random.randint(0,15000000,100)
B = [np.random.randint(0,15000000,k) for k in np.random.randint(2,101,100)]
int32
is sufficient
A32 = A.astype(np.int32)
We want to glue all the batches together. First, write down the batch sizes so we can separate them later.
from itertools import chain
sizes = np.fromiter(chain((0,),map(len,B)),np.int32,len(B)+1)
boundaries = sizes.cumsum()
# force int32
B_all = np.empty(boundaries[-1],np.int32)
np.concatenate(B,out=B_all)
After glueing resplit.
B32 = np.split(B_all, boundaries[1:-1])
Finally, make an array of pairs for convenience:
pairs = np.rec.fromarrays([A32,B32],names=["first","second"])
What was the point of glueing and then splitting again?
First, note that the resplit arrays are all views into B_all
, so we do not waste much memory by having both. Also, if we modify either B_all_
or B32
(or rather some of its elements) in place the other one will be automatically updated as well.
The advantage of having B_all
around is efficiency via numpy's reduceat
ufunc method. If we wanted for example the means of all batches we could do np.add.reduceat(B_all, boundaries[:-1]) / sizes
which is faster than looping through pairs['second']
Use numpy. It us the most efficient and you can use it easily with a machine learning model.
If you must store all values in memory, numpy will probably be the most efficient way. Pandas is built on top of numpy so it includes some overhead which you can avoid if you do not need any of the functionality that comes with pandas.
Numpy should have no memory issues when handling data of this size but another thing to consider, and this depends on how you will be using this data, is to use a generator to read from a file that has each pair on a new line. This would reduce memory usage significantly but would be slower than numpy for processing aggregate functions like sum() or max() and is more suitable if each value pair would be processed independently.
with open(file, 'r') as f:
data = (l for l in f) # generator
for line in data:
# process each record here