Pandas combining rows based on dates

前端 未结 2 763
醉梦人生
醉梦人生 2020-12-06 14:59

I have a dataframe of customers with records for shipments they received. Unfortunately, these can overlap. I\'m trying to reduce rows so that I can see dates of consecuti

相关标签:
2条回答
  • 2020-12-06 15:49

    If you are open to use an auxiliary data frame to hold the result, you can just loop through all the rows to be honest

    from time import strptime
    
    results = [df.iloc[0]]
    
    for i, (_, current_row) in enumerate(df1.iterrows()):
        try:
            next_row = df.iloc[i+1]        
            if strptime(current_row['endDate'], '%Y-%M-%d') < strptime(next_row['startDate'], '%Y-%M-%d'):
                results[-1]['endDate'] = current_row['endDate']
                results.append(next_row)
        except IndexError:
            pass
    
    print pd.DataFrame(results).reset_index(drop=True)
    
    0 讨论(0)
  • 2020-12-06 15:51

    Fundamentally, I think this is a graph connectivity problem: a fast way of solving it will be some manner of graph connectivity algorithm. Pandas doesn't include such tools, but scipy does. You can use the compressed sparse graph (csgraph) submodule in scipy to solve your problem like this:

    from scipy.sparse.csgraph import connected_components
    
    # convert to datetime, so min() and max() work
    df.startDate = pd.to_datetime(df.startDate)
    df.endDate = pd.to_datetime(df.endDate)
    
    def reductionFunction(data):
        # create a 2D graph of connectivity between date ranges
        start = data.startDate.values
        end = data.endDate.values
        graph = (start <= end[:, None]) & (end >= start[:, None])
    
        # find connected components in this graph
        n_components, indices = connected_components(graph)
    
        # group the results by these connected components
        return data.groupby(indices).aggregate({'startDate': 'min',
                                                'endDate': 'max',
                                                'shipNo': 'first'})
    
    df.groupby(['Cust']).apply(reductionFunction).reset_index('Cust')
    

    If you want to do something different with shipNo from here, it should be pretty straightforward.

    Note that the connected_components() function above is not brute force, but uses a fast algorithm to find the connections.

    0 讨论(0)
提交回复
热议问题