How to group dataframe rows into list in pandas groupby

前端 未结 12 2130
日久生厌
日久生厌 2020-11-21 04:56

I have a pandas data frame df like:

a b
A 1
A 2
B 5
B 5
B 4
C 6

I want to group by the first column and get second col

相关标签:
12条回答
  • 2020-11-21 05:45

    You can do this using groupby to group on the column of interest and then apply list to every group:

    In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
            df
    
    Out[1]: 
       a  b
    0  A  1
    1  A  2
    2  B  5
    3  B  5
    4  B  4
    5  C  6
    
    In [2]: df.groupby('a')['b'].apply(list)
    Out[2]: 
    a
    A       [1, 2]
    B    [5, 5, 4]
    C          [6]
    Name: b, dtype: object
    
    In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
            df1
    Out[3]: 
       a        new
    0  A     [1, 2]
    1  B  [5, 5, 4]
    2  C        [6]
    
    0 讨论(0)
  • 2020-11-21 05:52

    It is time to use agg instead of apply .

    When

    df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})
    

    If you want multiple columns stack into list , result in pd.DataFrame

    df.groupby('a')[['b', 'c']].agg(list)
    # or 
    df.groupby('a').agg(list)
    

    If you want single column in list, result in ps.Series

    df.groupby('a')['b'].agg(list)
    #or
    df.groupby('a')['b'].apply(list)
    

    Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .

    0 讨论(0)
  • 2020-11-21 05:52

    Answer based on @EdChum's comment on his answer. Comment is this -

    groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think 
    

    Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.

    df = pd.DataFrame(columns=['a', 'b'])
    df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
    df['b'] = list(range(20000000))
    print(df.shape)
    df.head()
    
    # Sort data by first column 
    df.sort_values(by=['a'], ascending=True, inplace=True)
    df.reset_index(drop=True, inplace=True)
    
    # Create a temp column
    df['temp_idx'] = list(range(df.shape[0]))
    
    # Take all values of b in a separate list
    all_values_b = list(df.b.values)
    print(len(all_values_b))
    
    # For each category in column a, find min and max indexes
    gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
    gp_df.reset_index(inplace=True)
    gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']
    
    # Now create final list_b column, using min and max indexes for each category of a and filtering list of b. 
    gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)
    
    print(gp_df.shape)
    gp_df.head()
    

    This above code takes 2 minutes for 20 million rows and 500k categories in first column.

    0 讨论(0)
  • 2020-11-21 05:55

    If looking for a unique list while grouping multiple columns this could probably help:

    df.groupby('a').agg(lambda x: list(set(x))).reset_index()
    
    0 讨论(0)
  • 2020-11-21 05:57

    As you were saying the groupby method of a pd.DataFrame object can do the job.

    Example

     L = ['A','A','B','B','B','C']
     N = [1,2,5,5,4,6]
    
     import pandas as pd
     df = pd.DataFrame(zip(L,N),columns = list('LN'))
    
    
     groups = df.groupby(df.L)
    
     groups.groups
          {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}
    

    which gives and index-wise description of the groups.

    To get elements of single groups, you can do, for instance

     groups.get_group('A')
    
         L  N
      0  A  1
      1  A  2
    
      groups.get_group('B')
    
         L  N
      2  B  5
      3  B  5
      4  B  4
    
    0 讨论(0)
  • 2020-11-21 05:58

    Let us using df.groupby with list and Series constructor

    pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
    Out[664]: 
    A       [1, 2]
    B    [5, 5, 4]
    C          [6]
    dtype: object
    
    0 讨论(0)
提交回复
热议问题