How to apply a function to two columns of Pandas dataframe

前端 未结 12 1197
名媛妹妹
名媛妹妹 2020-11-22 06:17

Suppose I have a df which has columns of \'ID\', \'col_1\', \'col_2\'. And I define a function :

f = lambda x, y : my_function_expres

相关标签:
12条回答
  • 2020-11-22 06:24

    Returning a list from apply is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let's walk through a simple example:

    df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
                      columns=['a', 'b', 'c'])
    df
       a  b  c
    0  4  0  0
    1  2  0  1
    2  2  2  2
    3  1  2  2
    4  3  0  0
    

    There are three possible outcomes with returning a list from apply

    1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.

    df.apply(lambda x: list(range(2)), axis=1)  # returns a Series
    0    [0, 1]
    1    [0, 1]
    2    [0, 1]
    3    [0, 1]
    4    [0, 1]
    dtype: object
    

    2) When the length of the returned list is equal to the number of columns then a DataFrame is returned and each column gets the corresponding value in the list.

    df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
       a  b  c
    0  0  1  2
    1  0  1  2
    2  0  1  2
    3  0  1  2
    4  0  1  2
    

    3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.

    i = 0
    def f(x):
        global i
        if i == 0:
            i += 1
            return list(range(3))
        return list(range(4))
    
    df.apply(f, axis=1) 
    ValueError: Shape of passed values is (5, 4), indices imply (5, 3)
    

    Answering the problem without apply

    Using apply with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.

    Create larger dataframe

    df1 = df.sample(100000, replace=True).reset_index(drop=True)
    

    Timings

    # apply is slow with axis=1
    %timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
    2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # zip - similar to @Thomas
    %timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]  
    29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    @Thomas answer

    %timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
    34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    0 讨论(0)
  • 2020-11-22 06:25

    A interesting question! my answer as below:

    import pandas as pd
    
    def sublst(row):
        return lst[row['J1']:row['J2']]
    
    df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
    print df
    lst = ['a','b','c','d','e','f']
    
    df['J3'] = df.apply(sublst,axis=1)
    print df
    

    Output:

      ID  J1  J2
    0  1   0   1
    1  2   2   4
    2  3   3   5
      ID  J1  J2      J3
    0  1   0   1     [a]
    1  2   2   4  [c, d]
    2  3   3   5  [d, e]
    

    I changed the column name to ID,J1,J2,J3 to ensure ID < J1 < J2 < J3, so the column display in right sequence.

    One more brief version:

    import pandas as pd
    
    df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
    print df
    lst = ['a','b','c','d','e','f']
    
    df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
    print df
    
    0 讨论(0)
  • 2020-11-22 06:32

    I'm going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it's great for functions you don't control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, 'foo').

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
    mylist = ['a','b','c','d','e','f']
    
    def get_sublist(sta,end):
        return mylist[sta:end+1]
    
    #df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
    # expect above to output df as below 
    
    df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])
    
    
    df
    
    ID  col_1   col_2   col_3
    0   1   0   1   [a, b]
    1   2   2   4   [c, d, e]
    2   3   3   5   [d, e, f]
    
    0 讨论(0)
  • 2020-11-22 06:33

    Here's an example using apply on the dataframe, which I am calling with axis = 1.

    Note the difference is that instead of trying to pass two values to the function f, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.

    In [49]: df
    Out[49]: 
              0         1
    0  1.000000  0.000000
    1 -0.494375  0.570994
    2  1.000000  0.000000
    3  1.876360 -0.229738
    4  1.000000  0.000000
    
    In [50]: def f(x):    
       ....:  return x[0] + x[1]  
       ....:  
    
    In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
    Out[51]: 
    0    1.000000
    1    0.076619
    2    1.000000
    3    1.646622
    4    1.000000
    

    Depending on your use case, it is sometimes helpful to create a pandas group object, and then use apply on the group.

    0 讨论(0)
  • 2020-11-22 06:33

    I'm sure this isn't as fast as the solutions using Pandas or Numpy operations, but if you don't want to rewrite your function you can use map. Using the original example data -

    import pandas as pd
    
    df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
    mylist = ['a','b','c','d','e','f']
    
    def get_sublist(sta,end):
        return mylist[sta:end+1]
    
    df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
    #In Python 2 don't convert above to list
    

    We could pass as many arguments as we wanted into the function this way. The output is what we wanted

    ID  col_1  col_2      col_3
    0  1      0      1     [a, b]
    1  2      2      4  [c, d, e]
    2  3      3      5  [d, e, f]
    
    0 讨论(0)
  • 2020-11-22 06:36

    My example to your questions:

    def get_sublist(row, col1, col2):
        return mylist[row[col1]:row[col2]+1]
    df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')
    
    0 讨论(0)
提交回复
热议问题