Pandas fill missing values in dataframe from another dataframe

前端 未结 5 823
遥遥无期
遥遥无期 2020-11-28 12:10

I cannot find a pandas function (which I had seen before) to substitute the NaN\'s in a dataframe with values from another dataframe (assuming a common index which can be sp

相关标签:
5条回答
  • 2020-11-28 12:26

    If you have two DataFrames of the same shape, then:

    df[df.isnull()] = d2
    

    Will do the trick.

    visual representation

    Only locations where df.isnull() evaluates to True (highlighted in green) will be eligible for assignment.

    In practice, the DataFrames aren't always the same size / shape, and transforming methods (especially .shift()) are useful.

    Data coming in is invariably dirty, incomplete, or inconsistent. Par for the course. There's a pretty extensive pandas tutorial and associated cookbook for dealing with these situations.

    0 讨论(0)
  • 2020-11-28 12:26

    A dedicated method for this is DataFrame.update:

    Quoted from the documentation:

    Modify in place using non-NA values from another DataFrame.
    Aligns on indices. There is no return value.

    Important to note is that this method will modify your data inplace. So it will overwrite your updated dataframe.

    Example:

    print(df1)
           A    B     C
    aaa  NaN  1.0   NaN
    bbb  NaN  NaN  10.0
    ccc  3.0  NaN   6.0
    ffffd  NaN  NaN   NaN
    eee  NaN  NaN   NaN
    
    print(df2)
             A    B     C
    index                
    aaa    1.0  1.0   NaN
    bbb    NaN  NaN  10.0
    eee    NaN  1.0   NaN
    
    # update df1 NaN where there are values in df2
    df1.update(df2)
    print(df1)
           A    B     C
    aaa  1.0  1.0   NaN
    bbb  NaN  NaN  10.0
    ccc  3.0  NaN   6.0
    ffffd  NaN  NaN   NaN
    eee  NaN  1.0   NaN
    

    Notice the updated NaN values at intersect aaa, A and eee, B

    0 讨论(0)
  • 2020-11-28 12:36

    This should be as simple as

    df.fillna(d2)
    
    0 讨论(0)
  • 2020-11-28 12:39

    DataFrame.combine_first() answers this question exactly.

    However, sometimes you want to fill/replace/overwrite some of the non-missing (non-NaN) values of DataFrame A with values from DataFrame B. That question brought me to this page, and the solution is DataFrame.mask()

    A = B.mask(condition, A)
    

    When condition is true, the values from A will be used, otherwise B's values will be used.

    For example, you could solve the OP's original question with mask such that when an element from A is non-NaN, use it, otherwise use the corresponding element from B.

    But using DataFrame.mask() you could replace the values of A that fail to meet arbitrary criteria (less than zero? more than 100?) with values from B. So mask is more flexible, and overkill for this problem, but I thought it was worthy of mention (I needed it to solve my problem).

    It's also important to note that B could be a numpy array instead of a DataFrame. DataFrame.combine_first() requires that B be a DataFrame, but DataFrame.mask() just requires that B's is an NDFrame and its dimensions match A's dimensions.

    0 讨论(0)
  • 2020-11-28 12:40

    As I just learned, there is a DataFrame.combine_first() method, which does precisely this, with the additional property that if your updating data frame d2 is bigger than your original df, the additional rows and columns are added, as well.

    df = df.combine_first(d2)
    
    0 讨论(0)
提交回复
热议问题