Finding top N columns for each row in data frame

后端 未结 5 2104
盖世英雄少女心
盖世英雄少女心 2021-01-05 01:24

given a data frame with one descriptive column and X numeric columns, for each row I\'d like to identify the top N columns with the higher values and save it as rows on a ne

相关标签:
5条回答
  • 2021-01-05 01:48

    Yet another crazy one-liner, given n = 3

    {index:option for (index, option) in zip(df['index'], 
        [df.columns[pd.notnull(x[1].where(x[1][1:].sort_values()[-n:]))].tolist()
            for x in df.iterrows()])}
    
    {'A': ['option2', 'option3', 'option4'],
     'C': ['option2', 'option4', 'option5'],
     'B': ['option1', 'option3', 'option4'],
     'E': ['option1', 'option2', 'option3'],
     'D': ['option1', 'option2', 'option5'],
     'F': ['option1', 'option3', 'option5']}
    
    0 讨论(0)
  • 2021-01-05 01:52

    Let's assume

    N = 3
    

    First of all I will create matrix of input fields and for each field remember what was original option for this cell:

    matrix = [[(j, 'option' + str(i)) for j in df['option' + str(i)]] for i in range(1,6)]
    

    The result of this line will be:

    [
     [(1, 'option1'), (5, 'option1'), (3, 'option1'), (7, 'option1'), (9, 'option1'), (3, 'option1')],
     [(8, 'option2'), (4, 'option2'), (5, 'option2'), (6, 'option2'), (9, 'option2'), (2, 'option2')],
     [(9, 'option3'), (9, 'option3'), (1, 'option3'), (3, 'option3'), (9, 'option3'), (5, 'option3')],
     [(3, 'option4'), (8, 'option4'), (3, 'option4'), (5, 'option4'), (7, 'option4'), (0, 'option4')],
     [(2, 'option5'), (3, 'option5'), (4, 'option5'), (9, 'option5'), (4, 'option5'), (2, 'option5')]
    ]
    

    Then we can easly transform matrix using zip function, sort result rows by first element of tuple and take N first items:

    transformed = [sorted(l, key=lambda x: x[0], reverse=True)[:N] for l in zip(*matrix)]
    

    List transformed will look like:

    [
     [(9, 'option3'), (8, 'option2'), (3, 'option4')],
     [(9, 'option3'), (8, 'option4'), (5, 'option1')],
     [(5, 'option2'), (4, 'option5'), (3, 'option1')],
     [(9, 'option5'), (7, 'option1'), (6, 'option2')],
     [(9, 'option1'), (9, 'option2'), (9, 'option3')],
     [(5, 'option3'), (3, 'option1'), (2, 'option2')]
    ]
    

    The last step will be joining column index and result tuple by:

    for id, top in zip(df['index'], transformed):
        for option in top:
            print id + ',' + option[1]
        print ''
    
    0 讨论(0)
  • 2021-01-05 01:59

    This might not be so elegant, but I think it pretty much gets what you want:

    n = 3
    df.index = pd.Index(df['index'])
    del df['index']
    df = df.transpose().unstack()
    for i, g in df.groupby(level=0):
        g = g.sort_values(ascending=False)
        print i, list(g.index.get_level_values(1)[:n])
    
    0 讨论(0)
  • 2021-01-05 02:02

    If you just want pairings:

    from operator import itemgetter as it
    from itertools import repeat
    n = 3
    
     # sort_values = order pandas < 0.17
    new_d = (zip(repeat(row["index"]), map(it(0),(row[1:].sort_values(ascending=0)[:n].iteritems())))
                     for _, row in df.iterrows())
    for row in new_d:
        print(list(row))
    

    Output:

    [('B', 'option3'), ('B', 'option4'), ('B', 'option1')]
    [('C', 'option2'), ('C', 'option5'), ('C', 'option1')]
    [('D', 'option5'), ('D', 'option1'), ('D', 'option2')]
    [('E', 'option1'), ('E', 'option2'), ('E', 'option3')]
    [('F', 'option3'), ('F', 'option1'), ('F', 'option2')]
    

    Which also maintains the order.

    If you want a list of lists:

    from operator import itemgetter as it
    from itertools import repeat
    n = 3
    
    new_d = [list(zip(repeat(row["index"]), map(it(0),(row[1:].sort_values(ascending=0)[:n].iteritems()))))
                     for _, row in df.iterrows()]
    

    Output:

    [[('A', 'option3'), ('A', 'option2'), ('A', 'option4')],
    [('B', 'option3'), ('B', 'option4'), ('B', 'option1')], 
    [('C', 'option2'), ('C', 'option5'), ('C', 'option1')], 
    [('D', 'option5'), ('D', 'option1'), ('D', 'option2')], 
    [('E', 'option1'), ('E', 'option2'), ('E', 'option3')],
    [('F', 'option3'), ('F', 'option1'), ('F', 'option2')]]
    

    Or using pythons sorted:

    new_d = [list(zip(repeat(row["index"]), map(it(0), sorted(row[1:].iteritems(), key=it(1) ,reverse=1)[:n])))
                         for _, row in df.iterrows()]
    

    Which is actually the fastest, if you really want strings, it is pretty trivial to format the output however you want.

    0 讨论(0)
  • 2021-01-05 02:04
    dfc = df.copy()
    result = {}
    
    #First, I would effectively transpose this
    
    for key in dfc:
        if key != 'index':
            for i in xrange(0,len(dfc['index'])):
                if dfc['index'][i] not in result:
                    result[dfc['index'][i]] = []
                result[dfc['index'][i]] += [(key,dfc[key][i])]
    
    
    def get_topn(result,n):
        #Use this to get the top value for each option
        return [x[0] for x in sorted(result,key=lambda x:-x[1])[0:min(len(result),n)]]
    
    
    #Lastly, print the output in your desired format.
    n = 3
    keys = sorted([k for k in result])
    for key in keys:
          for option in get_topn(result[key],n):
             print str(key) + ',' + str(option)
          print
    
    0 讨论(0)
提交回复
热议问题