Best data type (in terms of speed/RAM) for millions of pairs of a single int paired with a batch (2 to 100) of ints

后端 未结 3 1003
逝去的感伤
逝去的感伤 2021-01-29 03:37

I have about 15 million pairs that consist of a single int, paired with a batch of (2 to 100) other ints.

If it makes a difference, the ints themselve range from 0 to 1

相关标签:
3条回答
  • 2021-01-29 04:08

    I would do the following:

    # create example data 
    A = np.random.randint(0,15000000,100)                                      
    B = [np.random.randint(0,15000000,k) for k in np.random.randint(2,101,100)]
    

    int32 is sufficient

    A32 = A.astype(np.int32)
    

    We want to glue all the batches together. First, write down the batch sizes so we can separate them later.

    from itertools import chain
    
    sizes = np.fromiter(chain((0,),map(len,B)),np.int32,len(B)+1)
    boundaries = sizes.cumsum()
    
    # force int32
    B_all = np.empty(boundaries[-1],np.int32)
    np.concatenate(B,out=B_all)
    

    After glueing resplit.

    B32 = np.split(B_all, boundaries[1:-1])
    

    Finally, make an array of pairs for convenience:

    pairs = np.rec.fromarrays([A32,B32],names=["first","second"])
    

    What was the point of glueing and then splitting again?

    First, note that the resplit arrays are all views into B_all, so we do not waste much memory by having both. Also, if we modify either B_all_ or B32 (or rather some of its elements) in place the other one will be automatically updated as well.

    The advantage of having B_all around is efficiency via numpy's reduceat ufunc method. If we wanted for example the means of all batches we could do np.add.reduceat(B_all, boundaries[:-1]) / sizes which is faster than looping through pairs['second']

    0 讨论(0)
  • 2021-01-29 04:11

    Use numpy. It us the most efficient and you can use it easily with a machine learning model.

    0 讨论(0)
  • 2021-01-29 04:19

    If you must store all values in memory, numpy will probably be the most efficient way. Pandas is built on top of numpy so it includes some overhead which you can avoid if you do not need any of the functionality that comes with pandas.

    Numpy should have no memory issues when handling data of this size but another thing to consider, and this depends on how you will be using this data, is to use a generator to read from a file that has each pair on a new line. This would reduce memory usage significantly but would be slower than numpy for processing aggregate functions like sum() or max() and is more suitable if each value pair would be processed independently.

    with open(file, 'r') as f:
        data = (l for l in f)  # generator
            for line in data:
                # process each record here
    
    0 讨论(0)
提交回复
热议问题