How to deal with SettingWithCopyWarning in Pandas?

匿名 (未验证) 提交于 2019-12-03 01:12:01

问题:

Background

I just upgraded my Pandas from 0.11 to 0.13.0rc1. Now, the application is popping out many new warnings. One of them like this:

E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead   quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE 

I want to know what exactly it means? Do I need to change something?

How should I suspend the warning if I insist to use quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE?

The function that gives errors

def _decode_stock_quote(list_of_150_stk_str):     """decode the webpage and return dataframe"""      from cStringIO import StringIO      str_of_all = "".join(list_of_150_stk_str)      quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}     quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)     quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]     quote_df['TClose'] = quote_df['TPrice']     quote_df['RT']     = 100 * (quote_df['TPrice']/quote_df['TPCLOSE'] - 1)     quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE     quote_df['TAmt']   = quote_df['TAmt']/TAMT_SCALE     quote_df['STK_ID'] = quote_df['STK'].str.slice(13,19)     quote_df['STK_Name'] = quote_df['STK'].str.slice(21,30)#.decode('gb2312')     quote_df['TDate']  = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])      return quote_df 

More error messages

E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead   quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE E:\FinReporter\FM_EXT.py:450: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead   quote_df['TAmt']   = quote_df['TAmt']/TAMT_SCALE E:\FinReporter\FM_EXT.py:453: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead   quote_df['TDate']  = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10]) 

回答1:

From what I gather, SettingWithCopyWarning was created to flag potentially confusing "chained" assignments, such as the following, which don't always work as expected, particularly when the first selection returns a copy. [see GH5390 and GH5597 for background discussion.]

df[df['A'] > 2]['B'] = new_val  # new_val not set in df 

The warning offers a suggestion to rewrite as follows:

df.loc[df['A'] > 2, 'B'] = new_val 

However, this doesn't fit your usage, which is equivalent to:

df = df[df['A'] > 2] df['B'] = new_val 

While it's clear that you don't care about writes making it back to the original frame (since you overwrote the reference to it), unfortunately this pattern can not be differentiated from the first chained assignment example, hence the (false positive) warning. The potential for false positives is addressed in the docs on indexing, if you'd like to read further. You can safely disable this new warning with the following assignment.

pd.options.mode.chained_assignment = None  # default='warn' 


回答2:

In general the point of the SettingWithCopyWarning is to show users (and esp new users) that they may be operating on a copy and not the original as they think. There are False positives (IOW you know what you are doing, so it ok). One possibility is simply to turn off the (by default warn) warning as @Garrett suggest.

Here is a nother, per option.

In [1]: df = DataFrame(np.random.randn(5,2),columns=list('AB'))  In [2]: dfa = df.ix[:,[1,0]]  In [3]: dfa.is_copy Out[3]: True  In [4]: dfa['A'] /= 2 /usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead   #!/usr/local/bin/python 

You can set the is_copy flag to False, which will effectively turn off the check, *for that object``

In [5]: dfa.is_copy = False  In [6]: dfa['A'] /= 2 

If you explicity copy then you know what you are doing, so no further warning will happen.

In [7]: dfa = df.ix[:,[1,0]].copy()  In [8]: dfa['A'] /= 2 

The code the OP is showing above, while legitimate, and probably something I do as well, is technically a case for this warning, and not a False positive. Another way to not have the warning would be to do the selection operation via reindex, e.g.

quote_df = quote_df.reindex(columns=['STK',.......]) 

Or,

quote_df = quote_df.reindex(['STK',.......], axis=1) # v.0.21 


回答3:

Pandas dataframe copy warning

When you go and do something like this:

quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]] 

pandas.ix in this case returns a new, stand alone dataframe.

Any values you decide to change in this dataframe, will not change the original dataframe.

This is what pandas tries to warn you about.

Why .ix is a bad idea

The .ix object tries to do more than one thing, and for anyone who has read anything about clean code, this is a strong smell.

Given this dataframe:

df = pd.DataFrame({"a": [1,2,3,4], "b": [1,1,2,2]}) 

Two behaviors:

dfcopy = df.ix[:,["a"]] dfcopy.a.ix[0] = 2 

Behavior one: dfcopy is now a stand alone dataframe. Changing it will not change df

df.ix[0, "a"] = 3 

Behavior two: This changes the original dataframe.

Use .loc instead

The pandas developers recognized that the .ix object was quite smelly[speculatively] and thus created two new objects which helps in the accession and assignment of data. (The other being .iloc)

.loc is faster, because it does not try to create a copy of the data.

.loc is meant to modify your existing dataframe inplace, which is more memory efficient.

.loc is predictable, it has one behavior.

The solution

What you are doing in your code example is loading a big file with lots of columns, then modifying it to be smaller.

The pd.read_csv function can help you out with a lot of this and also make the loading of the file a lot faster.

So instead of doing this

quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64} quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True) quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]] 

Do this

columns = ['STK', 'TPrice', 'TPCLOSE', 'TOpen', 'THigh', 'TLow', 'TVol', 'TAmt', 'TDate', 'TTime'] df = pd.read_csv(StringIO(str_of_all), sep=',', usecols=[0,3,2,1,4,5,8,9,30,31]) df.columns = columns 

This will only read the columns you are interested in, and name them properly. No need for using the evil .ix object to do magical stuff.



回答4:

To remove any doubt, my solution was to make a deep copy of the slice instead of a regular copy. This may not be applicable depending on your context (Memory constraints / size of the slice, potential for performance degradation - especially if the copy occurs in a loop like it did for me, etc...)

To be clear, here is the warning I received:

/opt/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:54: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy 

Illustration

I had doubts that the warning was thrown because of a column I was dropping on a copy of the slice. While not technically trying to set a value in the copy of the slice, that was still a modification of the copy of the slice. Below are the (simplified) steps I have taken to confirm the suspicion, I hope it will help those of us who are trying to understand the warning.

Example 1: dropping a column on the original affects the copy

We knew that already but this is a healthy reminder. This is NOT what the warning is about.

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]} >> df1 = pd.DataFrame(data1) >> df1      A   B 0   111 121 1   112 122 2   113 123   >> df2 = df1 >> df2  A   B 0   111 121 1   112 122 2   113 123  # Dropping a column on df1 affects df2 >> df1.drop('A', axis=1, inplace=True) >> df2     B 0   121 1   122 2   123 

It is possible to avoid changes made on df1 to affect df2

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]} >> df1 = pd.DataFrame(data1) >> df1  A   B 0   111 121 1   112 122 2   113 123  >> import copy >> df2 = copy.deepcopy(df1) >> df2 A   B 0   111 121 1   112 122 2   113 123  # Dropping a column on df1 does not affect df2 >> df1.drop('A', axis=1, inplace=True) >> df2     A   B 0   111 121 1   112 122 2   113 123 

Example 2: dropping a column on the copy may affect the original

This actually illustrates the warning.

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]} >> df1 = pd.DataFrame(data1) >> df1      A   B 0   111 121 1   112 122 2   113 123  >> df2 = df1 >> df2      A   B 0   111 121 1   112 122 2   113 123  # Dropping a column on df2 can affect df1 # No slice involved here, but I believe the principle remains the same? # Let me know if not >> df2.drop('A', axis=1, inplace=True) >> df1  B 0   121 1   122 2   123 

It is possible to avoid changes made on df2 to affect df1

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]} >> df1 = pd.DataFrame(data1) >> df1      A   B 0   111 121 1   112 122 2   113 123  >> import copy >> df2 = copy.deepcopy(df1) >> df2  A   B 0   111 121 1   112 122 2   113 123  >> df2.drop('A', axis=1, inplace=True) >> df1  A   B 0   111 121 1   112 122 2   113 123 

Cheers!



回答5:

If you have assigned the slice to a variable and want to set using the variable as in the following:

df2 = df[df['A'] > 2] df2['B'] = value 

And you do not want to use Jeffs solution because your condition computing df2 is to long or for some other reason, then you can use the following:

df.loc[df2.index.tolist(), 'B'] = value 

df2.index.tolist() returns the indices from all entries in df2, which will then be used to set column B in the original dataframe.



回答6:

For me this issue occured in a following >simplified

old code with warning:

def update_old_dataframe(old_dataframe, new_dataframe):     for new_index, new_row in new_dataframe.iterrorws():         old_dataframe.loc[new_index] = update_row(old_dataframe.loc[new_index], new_row)  def update_row(old_row, new_row):     for field in [list_of_columns]:         # line with warning because of chain indexing old_dataframe[new_index][field]         old_row[field] = new_row[field]       return old_row 

This printed the warning for the line old_row[field] = new_row[field]

Since the rows in update_row method are actually type Series, I replaced the line with:

old_row.at[field] = new_row.at[field] 

i.e. method for accessing/lookups for a Series. Eventhough both works just fine and the result is same, this way I don't have to disable the warnings (=keep them for other chain indexing issues somewhere else).

I hope this may help someone.



回答7:

You could avoid the whole problem like this, I believe:

return (     pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}     .rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)     .ix[:,[0,3,2,1,4,5,8,9,30,31]]     .assign(         TClose=lambda df: df['TPrice'],         RT=lambda df: 100 * (df['TPrice']/quote_df['TPCLOSE'] - 1),         TVol=lambda df: df['TVol']/TVOL_SCALE,         TAmt=lambda df: df['TAmt']/TAMT_SCALE,         STK_ID=lambda df: df['STK'].str.slice(13,19),         STK_Name=lambda df: df['STK'].str.slice(21,30)#.decode('gb2312'),         TDate=lambda df: df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10]),     ) ) 

Using Assign. From the documentation: Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.

See Tom Augspurger's article on method chaining in pandas: https://tomaugspurger.github.io/method-chaining



回答8:

This should work:

quote_df.loc[:,'TVol'] = quote_df['TVol']/TVOL_SCALE 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!