Multiprocessing writing to pandas dataframe

后端 未结 1 2044
南旧
南旧 2021-02-09 11:03

So what I am trying to do with the following code is to read a list of lists and put them through function called checker and then have log_result deal

相关标签:
1条回答
  • 2021-02-09 11:54

    Updating DataFrames like this in MultiProcessing isn't going to work:

    dataf = dataf.append(new_row,ignore_index=True)
    

    For one thing this is very inefficient (O(n) for each append so O(n^2) in total. The preferred way is to concat some objects together in one pass.

    For another, and more importantly, dataf is not locking for each update, so there's no guarantee that two operations won't conflict (I'm guessing this is crashing python).

    Finally, append doesn't act in place, so the variable dataf is discarded once the callback is finished!! and no changes are made to the parent dataf.


    We could use MultiProcessing list or a dict. list if you don't care about order or dict if you do (e.g. enumerate), as you must note that the values are returned not in a well-defined order from async.
    (or we could create an object which implements Lock ourselves, see Eli Bendersky.)
    So the following changes are made:

    df = pd.DataFrame(existing_data,columns=cols)
    # becomes
    df = pd.DataFrame(existing_data,columns=cols)
    d = MultiProcessing.list([df])
    
    dataf = dataf.append(new_row,ignore_index=True)
    # becomes
    d.append(new_row)
    

    Now, once the async has finished you have a MultiProcessing.list of DataFrames. You can concat these (and ignore_index) to get the desired result:

    pd.concat(d, ignore_index=True)
    

    Should do the trick.


    Note: creating the newrow DataFrame at each stage is also less efficient that letting pandas parse the list of lists directly to a DataFrame in one go. Hopefully this is a toy example, really you want your chunks to be quite large to get wins with MultiProcessing (I've heard 50kb as a rule-of-thumb...), a row at a time is never going to be a win here.


    Aside: You should avoid using globals (like df) in your code, it's much cleaner to pass them around in your functions (in this case, as an argument to checker).

    0 讨论(0)
提交回复
热议问题