Recording data in a long running python simulation

前端 未结 2 407
深忆病人
深忆病人 2021-01-29 04:40

I am running a simulation from which I need to record some small numpy arrays every cycle. My current solution is to load, write then save as follows:

existing_d         


        
相关标签:
2条回答
  • 2021-01-29 05:19

    I think one solution is using a memory mapped file through numpy.memmap. The code can be found below. The documentation contains important information to understand the code.

    import numpy as np
    from os.path import getsize
    
    from time import time
    
    filename = "data.bin"
    
    # Datatype used for memmap
    dtype = np.int32
    
    # Create memmap for the first time (w+). Arbitrary shape. Probably good to try and guess the correct size.
    mm = np.memmap(filename, dtype=dtype, mode='w+', shape=(1, ))
    print("File has {} bytes".format(getsize(filename)))
    
    
    N = 20
    num_data_per_loop = 10**7
    
    # Main loop to append data
    for i in range(N):
    
        # will extend the file because mode='r+'
        starttime = time()
        mm = np.memmap(filename,
                       dtype=dtype,
                       mode='r+',
                       offset=np.dtype(dtype).itemsize*num_data_per_loop*i,
                       shape=(num_data_per_loop, ))
        mm[:] = np.arange(start=num_data_per_loop*i, stop=num_data_per_loop*(i+1))
        mm.flush()
        endtime = time()
        print("{:3d}/{:3d} ({:6.4f} sec): File has {} bytes".format(i, N, endtime-starttime, getsize(filename)))
    
    A = np.array(np.memmap(filename, dtype=dtype, mode='r'))
    if np.array_equal(A, np.arange(num_data_per_loop*N, dtype=dtype)):
        print("Correct")
    

    The output I get is:

    File has 4 bytes
      0/ 20 (0.2167 sec): File has 40000000 bytes
      1/ 20 (0.2200 sec): File has 80000000 bytes
      2/ 20 (0.2131 sec): File has 120000000 bytes
      3/ 20 (0.2180 sec): File has 160000000 bytes
      4/ 20 (0.2215 sec): File has 200000000 bytes
      5/ 20 (0.2141 sec): File has 240000000 bytes
      6/ 20 (0.2187 sec): File has 280000000 bytes
      7/ 20 (0.2138 sec): File has 320000000 bytes
      8/ 20 (0.2137 sec): File has 360000000 bytes
      9/ 20 (0.2227 sec): File has 400000000 bytes
     10/ 20 (0.2168 sec): File has 440000000 bytes
     11/ 20 (0.2141 sec): File has 480000000 bytes
     12/ 20 (0.2150 sec): File has 520000000 bytes
     13/ 20 (0.2144 sec): File has 560000000 bytes
     14/ 20 (0.2190 sec): File has 600000000 bytes
     15/ 20 (0.2186 sec): File has 640000000 bytes
     16/ 20 (0.2210 sec): File has 680000000 bytes
     17/ 20 (0.2146 sec): File has 720000000 bytes
     18/ 20 (0.2178 sec): File has 760000000 bytes
     19/ 20 (0.2182 sec): File has 800000000 bytes
    Correct
    

    The time is approximately constant over the iterations because of the offsets used for memmap. Also the amount of RAM needed (apart from loading the whole memmap for the check at the end) is constant.

    I hope this solves your performance issues

    kind regards

    Lukas

    Edit 1: It seems the poster has solved his own question. I leave this answer up as an alternative.

    0 讨论(0)
  • 2021-01-29 05:31

    I have found a good working solution using the h5py library. Performance is far better as there is no reading data and I have cut down on the number of nump array append operations. A short example:

    with h5py.File("logfile_name", "a") as f:
      ds = f.create_dataset("weights", shape=(3,2,100000), maxshape=(3, 2, None))
      ds[:,:,cycle_num] = weight_matrix
    

    I am not sure if the numpy style slicing means the matrix gets copied but there is a write_direct(source, source_sel=None, dest_sel=None) function to avoid this happening which could be useful for larger matrices.

    0 讨论(0)
提交回复
热议问题