How to cut up my dataframe in chunks, but keeping groups together

后端 未结 2 787
-上瘾入骨i
-上瘾入骨i 2021-01-29 08:01

I currently have a massive set of datasets. I have a set for each year in the 2000\'s. I take a combination of three years and run a code on that to clean. The problem is that d

相关标签:
2条回答
  • 2021-01-29 08:26

    Use chunked pandas by importing Blaze. Instructions from http://blaze.readthedocs.org/en/latest/ooc.html


    Naive use of Blaze triggers out-of-core systems automatically when called on large files.

    d = Data('my-small-file.csv')  
    d.my_column.count()  # Uses Pandas  
    
    d = Data('my-large-file.csv')  
    d.my_column.count()  # Uses Chunked Pandas  
    

    How does it work? Blaze breaks up the data resource into a sequence of chunks. It pulls one chunk into memory, operates on it, pulls in the next, etc.. After all chunks are processed it often has to finalize the computation with another operation on the intermediate results.

    0 讨论(0)
  • 2021-01-29 08:32

    One way to achieve this would be like as follows:

    import pandas as pd
    
    # generating random DF
    num_rows = 100
    
    locs = list('abcdefghijklmno')
    
    df = pd.DataFrame(
            {'id': np.random.randint(1, 100, num_rows),
             'location': np.random.choice(locs, num_rows),
             'year': np.random.randint(2005, 2007, num_rows)})
    
    df.sort_values('id', inplace=True)
    
    print('**** sorted DF (first 10 rows) ****')
    print(df.head(10))
    
    # chopping DF into chunks ...
    chunk_size = 5
    
    chunks = [i for i in df.id.unique()[::chunk_size]]
    
    chunk_margins = [(chunks[i-1],chunks[i]) for i in range(1, len(chunks))]
    
    df_chunks = [df.ix[(df.id >= x[0]) & (df.id < x[1])] for x in chunk_margins]
    
    print('**** first chunk ****')
    print(df_chunks[0])
    

    Output:

    **** sorted DF (first 10 rows) ****
        id location  year
    31   2        c  2005
    85   2        e  2006
    89   2        l  2006
    70   2        i  2005
    60   4        n  2005
    68   7        g  2005
    22   7        e  2006
    73  10        i  2005
    23  10        j  2006
    47  16        n  2005
    
    **** first chunk ****
        id location  year
    31   2        c  2005
    85   2        e  2006
    89   2        l  2006
    70   2        i  2005
    60   4        n  2005
    68   7        g  2005
    22   7        e  2006
    73  10        i  2005
    23  10        j  2006
    47  16        n  2005
    6   16        k  2006
    82  16        g  2005
    
    0 讨论(0)
提交回复
热议问题