Best data type (in terms of speed/RAM) for millions of pairs of a single int paired with a batch (2 to 100) of ints

后端未结

关注

 3  1003

逝去的感伤

I have about 15 million pairs that consist of a single int, paired with a batch of (2 to 100) other ints.

If it makes a difference, the ints themselve range from 0 to 1

相关标签:

3条回答

花落未央

2021-01-29 04:08
I would do the following:
```
# create example data 
A = np.random.randint(0,15000000,100)                                      
B = [np.random.randint(0,15000000,k) for k in np.random.randint(2,101,100)]
```
int32 is sufficient
```
A32 = A.astype(np.int32)
```
We want to glue all the batches together. First, write down the batch sizes so we can separate them later.
```
from itertools import chain

sizes = np.fromiter(chain((0,),map(len,B)),np.int32,len(B)+1)
boundaries = sizes.cumsum()

# force int32
B_all = np.empty(boundaries[-1],np.int32)
np.concatenate(B,out=B_all)
```
After glueing resplit.
```
B32 = np.split(B_all, boundaries[1:-1])
```
Finally, make an array of pairs for convenience:
```
pairs = np.rec.fromarrays([A32,B32],names=["first","second"])
```
What was the point of glueing and then splitting again?

First, note that the resplit arrays are all views into B_all, so we do not waste much memory by having both. Also, if we modify either B_all_ or B32 (or rather some of its elements) in place the other one will be automatically updated as well.

The advantage of having B_all around is efficiency via numpy's reduceat ufunc method. If we wanted for example the means of all batches we could do np.add.reduceat(B_all, boundaries[:-1]) / sizes which is faster than looping through pairs['second']
0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2021-01-29 04:11

Use numpy. It us the most efficient and you can use it easily with a machine learning model.

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2021-01-29 04:19
If you must store all values in memory, numpy will probably be the most efficient way. Pandas is built on top of numpy so it includes some overhead which you can avoid if you do not need any of the functionality that comes with pandas.

Numpy should have no memory issues when handling data of this size but another thing to consider, and this depends on how you will be using this data, is to use a generator to read from a file that has each pair on a new line. This would reduce memory usage significantly but would be slower than numpy for processing aggregate functions like sum() or max() and is more suitable if each value pair would be processed independently.
```
with open(file, 'r') as f:
    data = (l for l in f)  # generator
        for line in data:
            # process each record here
```
0 讨论(0)
发布评论:

提交评论
- 加载中...