Pandas: Knowing when an operation affects the original dataframe

前端 未结 3 450
孤城傲影
孤城傲影 2020-12-23 02:04

I love pandas and have been using it for years and feel pretty confident I have a good handle on how to subset dataframes and deal with views vs copies appropriately (though

相关标签:
3条回答
  • 2020-12-23 02:25

    This is a somewhat confusing and even frustrating part of pandas, but for the most part you shouldn't really have to worry about this if you follow some simple workflow rules. In particular, note that there are only two general cases here when you have two dataframes, with one being a subset of the other.

    This is a case where the Zen of Python rule "explicit is better than implicit" is a great guideline to follow.

    Case A: Changes to df2 should NOT affect df1

    This is trivial, of course. You want two completely independent dataframes so you just explicitly make a copy:

    df2 = df1.copy()
    

    After this anything you do to df2 affects only df2 and not df1 and vice versa.

    Case B: Changes to df2 should ALSO affect df1

    In this case I don't think there is one general way to solve the problem because it depends on exactly what you're trying to do. However, there are a couple of standard approaches that are pretty straightforward and should not have any ambiguity about how they are working.

    Method 1: Copy df1 to df2, then use df2 to update df1

    In this case, you can basically do a one to one conversion of the examples above. Here's example #2:

    df2 = df1.copy()
    df2 = df1.query('A < 10')
    df2.iloc[0,1] = 100
    
    df1 = df2.append(df1).reset_index().drop_duplicates(subset='index').drop(columns='index')
    

    Unfortunately the re-merging via append is a bit verbose there. You can do it more cleanly with the following, although it has the side effect of converting integers to floats.

    df1.update(df2)   # note that this is an inplace operation
    

    Method 2: Use a mask (don't create df2 at all)

    I think the best general approach here is not to create df2 at all, but rather have it be a masked version of df1. Somewhat unfortunately, you can't do a direct translation of the above code due to its mixing of loc and iloc which is fine for this example though probably unrealistic for actual use.

    The advantage is that you can write very simple and readable code. Here's an alternative version of example #2 above where df2 is actually just a masked version of df1. But instead of changing via iloc, I'll change if column "C" == 10.

    df2_mask = df1['A'] < 10
    df1.loc[ df2_mask & (df1['C'] == 10), 'B'] = 100
    

    Now if you print df1 or df1[df2_mask] you will see that column "B" = 100 for the first row of each dataframe. Obviously this is not very surprising here, but that's the inherent advantage of following "explicit is better than implicit".

    0 讨论(0)
  • 2020-12-23 02:25

    I have the same doubt, I searched for this response in the past without success. So now, I just certify that original is not changing and use this peace of code to the program at begining to remove warnings:

     import pandas as pd
     pd.options.mode.chained_assignment = None  # default='warn'
    
    0 讨论(0)
  • 2020-12-23 02:42

    You only need to replace .iloc[0,1] with .iat[0,1].

    More in general if you want to modify only one element you should use .iat or .at method. Instead when you are modifying more elements at one time you should use .loc or .iloc methods.

    Doing in this way pandas shuldn't throw any warning.

    0 讨论(0)
提交回复
热议问题