Checking whether data frame is copy or view in Pandas

前端 未结 3 649
别跟我提以往
别跟我提以往 2020-12-07 16:38

Is there an easy way to check whether two data frames are different copies or views of the same underlying data that doesn\'t involve manipulations? I\'m trying to get a gri

相关标签:
3条回答
  • 2020-12-07 17:21

    I've elaborated on this example with pandas 1.0.1. There's not only a boolean _is_view attribute, but also _is_copy which can be None or a reference to the original DataFrame:

    df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'], 
            columns = ['a','b','c','d'])
    df2 = df.iloc[0:2, :]
    df3 = df.loc[df['a'] == 1, :]
    
    # df is neither copy nor view
    df._is_view, df._is_copy
    Out[1]: (False, None)
    
    # df2 is a view AND a copy
    df2._is_view, df2._is_copy
    Out[2]: (True, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)
    
    # df3 is not a view, but a copy
    df3._is_view, df3._is_copy
    Out[3]: (False, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)
    

    So checking these two attributes should tell you not only if you're dealing with a view or not, but also if you have a copy or an "original" DataFrame.

    See also this thread for a discussion explaining why you can't always predict whether your code will return a view or not.

    0 讨论(0)
  • 2020-12-07 17:29

    You might trace the memory your pandas/python environment is consuming, and, on the assumption that a copy will utilise more memory than a view, be able to decide one way or another.

    I believe there are libraries out there that will present the memory usage within the python environment itself - e.g. Heapy/Guppy.

    There ought to be a metric you can apply that takes a baseline picture of memory usage prior to creating the object under inspection, then another picture afterwards. Comparison of the two memory maps (assuming nothing else has been created and we can isolate the change is due to the new object) should provide an idea of whether a view or copy has been produced.

    We'd need to get an idea of the different memory profiles of each type of implementation, but some experimentation should yield results.

    0 讨论(0)
  • 2020-12-07 17:32

    Answers from HYRY and Marius in comments!

    One can check either by:

    • testing equivalence of the values.base attribute rather than the values attribute, as in:

      df.values.base is df2.values.base instead of df.values is df2.values.

    • or using the (admittedly internal) _is_view attribute (df2._is_view is True).

    Thanks everyone!

    0 讨论(0)
提交回复
热议问题