sum values of columns starting with the same string in pandas dataframe

前端 未结 2 864
小鲜肉
小鲜肉 2021-01-12 00:47

I have a dataframe with about 100 columns that looks like this:

   Id  Economics-1  English-107  English-2  History-3  Economics-zz  Economics-2  \\
0  56            


        
相关标签:
2条回答
  • 2021-01-12 01:15

    I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.

    Consider the following:

    df = pd.DataFrame({
            'a_a': [1, 2, 3, 4],
            'a_b': [2, 3, 4, 5],
            'b_a': [1, 2, 3, 4],
            'b_b': [2, 3, 4, 5],
        })
    

    Now

    [s.split('_')[0] for s in df.T.index.values]
    

    is the prefix of the columns. So

    >>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
        a   b
    0   3   3
    1   5   5
    2   7   7
    3   9   9
    

    does what you want.

    In your case, make sure to split using the '-' character.

    0 讨论(0)
  • 2021-01-12 01:25

    Using brilliant DSM's idea:

    from __future__ import print_function
    
    import pandas as pd
    
    categories = set(['Economics', 'English', 'Histo', 'Literature'])
    
    def correct_categories(cols):
        return [cat for col in cols for cat in categories if col.startswith(cat)]    
    
    df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
    
    #print(df)
    print(df.groupby(correct_categories(df.columns),axis=1).sum())
    

    Output:

        Economics  English  Histo  Literature
    Id
    56          1        1      2           1
    11          1        0      0           1
    6           1        1      0           0
    43          2        0      1           1
    14          1        1      1           0
    

    Here is another version, which takes care of "Histo/History" problematic..

    from __future__ import print_function
    
    import pandas as pd
    
    #categories = set(['Economics', 'English', 'Histo', 'Literature'])
    
    #
    # mapping: common starting pattern: desired name
    #
    categories = {
        'Histo': 'History',
        'Economics': 'Economics',
        'English': 'English',
        'Literature': 'Literature'
    }
    
    def correct_categories(cols):
        return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]
    
    df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
    #print(df.columns, len(df.columns))
    #print(correct_categories(df.columns), len(correct_categories(df.columns)))
    #print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())
    
    rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
    print(rslt)
    print('History\n', rslt['History'])
    

    Output:

        Economics  English  History  Literature
    Id
    56          1        1        2           1
    11          1        0        0           1
    6           1        1        0           0
    43          2        0        1           1
    14          1        1        1           0
    History
     Id
    56    2
    11    0
    6     0
    43    1
    14    1
    Name: History, dtype: int64
    

    PS You may want to add missing categories to categories map/dictionary

    0 讨论(0)
提交回复
热议问题