Trouble with grouby on millions of keys on a chunked file in python pandas

前端 未结 1 758
小鲜肉
小鲜肉 2021-01-14 16:20

I have a very big CSV file (tens of Gigas) containing web logs with the following columns: user_id, time_stamp, category_clicked. I ha

相关标签:
1条回答
  • 2021-01-14 16:53

    Here's a soln for scaling this problem arbitrarily. This is in effect a high-density version of this question here

    Define a function to hash a particular group value to a smaller number of groups. I would design this such that it divides your dataset into in-memory manageable pieces.

    def sub_group_hash(x):
        # x is a dataframe with the 'user id' field given above
        # return the last 2 characters of the input
        # if these are number like, then you will be sub-grouping into 100 sub-groups
        return x['user id'].str[-2:]
    

    Using the data provided above, this creates a grouped frame on the input data like so:

    In [199]: [ (grp, grouped) for grp, grouped in df.groupby(sub_group_hash) ][0][1]
    Out[199]: 
                                 user id  timestamp  category
    0  20140512081646222000004-927168801   20140722         7
    3  20140512081646222000004-927168801   20140724         1
    

    with grp as the name of the group, and grouped as resultant frame

    # read in the input in a chunked way
    clean_input_reader = read_csv('input.csv', chunksize=500000)
    with get_store('output.h5') as store:
        for chunk in clean_input_reader:
    
            # create a grouper for each chunk using the sub_group_hash
            g = chunk.groupby(sub_group_hash)
    
            # append each of the subgroups to a separate group in the resulting hdf file
            # this will be a loop around the sub_groups (100 max in this case)
            for grp, grouped in g:
    
                store.append('group_%s' % grp, grouped,
                             data_columns=['user_id','timestamp','category_clicked'],
                             min_itemsize=15)
    

    Now you have a hdf file with 100 sub-groups (potentially less if not all groups were represented), each of which contains all of the data necessary for performing your operation.

    with get_store('output.h5') as store:
    
        # all of the groups are now the keys of the store
        for grp in store.keys():
    
            # this is a complete group that will fit in memory
            grouped = store.select(grp)
    
            # perform the operation on grouped and write the new output
            grouped.groupby(......).apply(your_cool_function)
    

    So this will reduce the problem by a factor of 100 in this case. If that is not sufficient, then simply increase the sub_group_hash to make more groups.

    You should strive for a smaller number as the HDF5 works better (e.g. don't make 10M sub_groups that defeats the purpose, 100, 1000, even 10k is ok). But I think 100 should prob work for you, unless you have a very wild group density (e.g. you have massive numbers in a single group, while very few in other groups).

    Note that this problem then scales easily; you could store the sub_groups in separate files if you want, and/or work on them separately (in parallel) if necessary.

    This should make your soln time approx O(number_of_sub_groups).

    0 讨论(0)
提交回复
热议问题