How to group dataframe rows into list in pandas groupby

前端 未结 12 2167
日久生厌
日久生厌 2020-11-21 04:56

I have a pandas data frame df like:

a b
A 1
A 2
B 5
B 5
B 4
C 6

I want to group by the first column and get second col

12条回答
  •  囚心锁ツ
    2020-11-21 05:52

    Answer based on @EdChum's comment on his answer. Comment is this -

    groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think 
    

    Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.

    df = pd.DataFrame(columns=['a', 'b'])
    df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
    df['b'] = list(range(20000000))
    print(df.shape)
    df.head()
    
    # Sort data by first column 
    df.sort_values(by=['a'], ascending=True, inplace=True)
    df.reset_index(drop=True, inplace=True)
    
    # Create a temp column
    df['temp_idx'] = list(range(df.shape[0]))
    
    # Take all values of b in a separate list
    all_values_b = list(df.b.values)
    print(len(all_values_b))
    
    # For each category in column a, find min and max indexes
    gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
    gp_df.reset_index(inplace=True)
    gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']
    
    # Now create final list_b column, using min and max indexes for each category of a and filtering list of b. 
    gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)
    
    print(gp_df.shape)
    gp_df.head()
    

    This above code takes 2 minutes for 20 million rows and 500k categories in first column.

提交回复
热议问题