Pandas add column with value based on condition based on other columns

后端 未结 1 931
我寻月下人不归
我寻月下人不归 2021-01-30 04:41

I have the following pandas dataframe:

import pandas as pd
import numpy as np

d = {\'age\' : [21, 45, 45, 5],
     \'salary\' : [20, 40, 10, 100]}

df          


        
1条回答
  •  南笙
    南笙 (楼主)
    2021-01-30 05:06

    Use the timeits, Luke!

    Conclusion
    List comprehensions perform the best on smaller amounts of data because they incur very little overhead, even though they are not vectorized. OTOH, on larger data, loc and numpy.where perform better - vectorisation wins the day.

    Keep in mind that the applicability of a method depends on your data, the number of conditions, and the data type of your columns. My suggestion is to test various methods on your data before settling on an option.

    One sure take away from here, however, is that list comprehensions are pretty competitive—they're implemented in C and are highly optimised for performance.


    Benchmarking code, for reference. Here are the functions being timed:

    def numpy_where(df):
      return df.assign(is_rich=np.where(df['salary'] >= 50, 'yes', 'no'))
    
    def list_comp(df):
      return df.assign(is_rich=['yes' if x >= 50 else 'no' for x in df['salary']])
    
    def loc(df):
      df = df.assign(is_rich='no')
      df.loc[df['salary'] > 50, 'is_rich'] = 'yes'
      return df
    

    0 讨论(0)
提交回复
热议问题