pandas create new column based on values from other columns / apply a function of multiple columns, row-wise

后端 未结 5 555
广开言路
广开言路 2020-11-22 06:24

I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian,

相关标签:
5条回答
  • 2020-11-22 06:25

    try this,

    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
    

    O/P:

         lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
    0      MOST    JEFF      E             0          0             0   
    1    CRUISE     TOM      E             0          0             0   
    2      DEPP  JOHNNY    NaN             0          0             0   
    3     DICAP     LEO    NaN             0          0             0   
    4    BRANDO  MARLON      E             0          0             0   
    5     HANKS     TOM    NaN             0          0             0   
    6    DENIRO  ROBERT      E             0          1             0   
    7    PACINO      AL      E             0          0             0   
    8  WILLIAMS   ROBIN      E             0          0             1   
    9  EASTWOOD   CLINT      E             0          0             0   
    
       eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
    0             0             0          1       White         White  
    1             1             0          0       White      Hispanic  
    2             0             0          1     Unknown         White  
    3             0             0          1     Unknown         White  
    4             0             0          0       White         Other  
    5             0             0          1     Unknown         White  
    6             0             0          1       White   Two Or More  
    7             0             0          1       White         White  
    8             0             0          0       White  Haw/Pac Isl.  
    9             0             0          1       White         White 
    

    use .loc instead of apply.

    it improves vectorization.

    .loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.

    for more details visit, .loc docs

    Performance metrics:

    Accepted Answer:

    def label_race (row):
       if row['eri_hispanic'] == 1 :
          return 'Hispanic'
       if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
          return 'Two Or More'
       if row['eri_nat_amer'] == 1 :
          return 'A/I AK Native'
       if row['eri_asian'] == 1:
          return 'Asian'
       if row['eri_afr_amer']  == 1:
          return 'Black/AA'
       if row['eri_hawaiian'] == 1:
          return 'Haw/Pac Isl.'
       if row['eri_white'] == 1:
          return 'White'
       return 'Other'
    
    df=pd.read_csv('dataser.csv')
    df = pd.concat([df]*1000)
    
    %timeit df.apply(lambda row: label_race(row), axis=1)
    

    1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    My Proposed Answer:

    def label_race(df):
        df.loc[df['eri_white']==1,'race_label'] = 'White'
        df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
        df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
        df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
        df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
        df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
        df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
        df['race_label'].fillna('Other', inplace=True)
    df=pd.read_csv('s22.csv')
    df = pd.concat([df]*1000)
    
    %timeit label_race(df)
    

    24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    0 讨论(0)
  • 2020-11-22 06:30

    OK, two steps to this - first is to write a function that does the translation you want - I've put an example together based on your pseudo-code:

    def label_race (row):
       if row['eri_hispanic'] == 1 :
          return 'Hispanic'
       if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
          return 'Two Or More'
       if row['eri_nat_amer'] == 1 :
          return 'A/I AK Native'
       if row['eri_asian'] == 1:
          return 'Asian'
       if row['eri_afr_amer']  == 1:
          return 'Black/AA'
       if row['eri_hawaiian'] == 1:
          return 'Haw/Pac Isl.'
       if row['eri_white'] == 1:
          return 'White'
       return 'Other'
    

    You may want to go over this, but it seems to do the trick - notice that the parameter going into the function is considered to be a Series object labelled "row".

    Next, use the apply function in pandas to apply the function - e.g.

    df.apply (lambda row: label_race(row), axis=1)
    

    Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:

    0           White
    1        Hispanic
    2           White
    3           White
    4           Other
    5           White
    6     Two Or More
    7           White
    8    Haw/Pac Isl.
    9           White
    

    If you're happy with those results, then run it again, saving the results into a new column in your original dataframe.

    df['race_label'] = df.apply (lambda row: label_race(row), axis=1)
    

    The resultant dataframe looks like this (scroll to the right to see the new column):

          lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label
    0      MOST    JEFF      E             0          0             0              0             0          1       White         White
    1    CRUISE     TOM      E             0          0             0              1             0          0       White      Hispanic
    2      DEPP  JOHNNY    NaN             0          0             0              0             0          1     Unknown         White
    3     DICAP     LEO    NaN             0          0             0              0             0          1     Unknown         White
    4    BRANDO  MARLON      E             0          0             0              0             0          0       White         Other
    5     HANKS     TOM    NaN             0          0             0              0             0          1     Unknown         White
    6    DENIRO  ROBERT      E             0          1             0              0             0          1       White   Two Or More
    7    PACINO      AL      E             0          0             0              0             0          1       White         White
    8  WILLIAMS   ROBIN      E             0          0             1              0             0          0       White  Haw/Pac Isl.
    9  EASTWOOD   CLINT      E             0          0             0              0             0          1       White         White
    
    0 讨论(0)
  • 2020-11-22 06:37

    The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:


    First, define conditions:

    conditions = [
        df['eri_hispanic'] == 1,
        df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
        df['eri_nat_amer'] == 1,
        df['eri_asian'] == 1,
        df['eri_afr_amer'] == 1,
        df['eri_hawaiian'] == 1,
        df['eri_white'] == 1,
    ]
    

    Now, define the corresponding outputs:

    outputs = [
        'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ]
    

    Finally, using numpy.select:

    res = np.select(conditions, outputs, 'Other')
    pd.Series(res)
    

    0           White
    1        Hispanic
    2           White
    3           White
    4           Other
    5           White
    6     Two Or More
    7           White
    8    Haw/Pac Isl.
    9           White
    dtype: object
    

    Why should numpy.select be used over apply? Here are some performance checks:

    df = pd.concat([df]*1000)
    
    In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
    1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [44]: %%timeit
        ...: conditions = [
        ...:     df['eri_hispanic'] == 1,
        ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
        ...:     df['eri_nat_amer'] == 1,
        ...:     df['eri_asian'] == 1,
        ...:     df['eri_afr_amer'] == 1,
        ...:     df['eri_hawaiian'] == 1,
        ...:     df['eri_white'] == 1,
        ...: ]
        ...:
        ...: outputs = [
        ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
        ...: ]
        ...:
        ...: np.select(conditions, outputs, 'Other')
        ...:
        ...:
    3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.

    0 讨论(0)
  • 2020-11-22 06:43

    .apply() takes in a function as the first parameter; pass in the label_race function as so:

    df['race_label'] = df.apply(label_race, axis=1)
    

    You don't need to make a lambda function to pass in a function.

    0 讨论(0)
  • 2020-11-22 06:44

    Since this is the first Google result for 'pandas new column from others', here's a simple example:

    import pandas as pd
    
    # make a simple dataframe
    df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
    df
    #    a  b
    # 0  1  3
    # 1  2  4
    
    # create an unattached column with an index
    df.apply(lambda row: row.a + row.b, axis=1)
    # 0    4
    # 1    6
    
    # do same but attach it to the dataframe
    df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
    df
    #    a  b  c
    # 0  1  3  4
    # 1  2  4  6
    

    If you get the SettingWithCopyWarning you can do it this way also:

    fn = lambda row: row.a + row.b # define a function for the new column
    col = df.apply(fn, axis=1) # get column data with an index
    df = df.assign(c=col.values) # assign values to column 'c'
    

    Source: https://stackoverflow.com/a/12555510/243392

    And if your column name includes spaces you can use syntax like this:

    df = df.assign(**{'some column name': col.values})
    

    And here's the documentation for apply, and assign.

    0 讨论(0)
提交回复
热议问题