Pandas column creation

那年仲夏 提交于 2019-12-10 23:28:32

问题


I'm struggling to understand the concept behind column naming conventions, given that one of the following attempts to create a new column appears to fail:

from numpy.random import randn
import pandas as pd

df = pd.DataFrame({'a':range(0,10,2), 'c':range(0,1000,200)},
columns=list('ac'))
df['b'] = 10*df.a
df

gives the following result:

Yet, if I were to try to create column b by substituting with the following line, there is no error message, yet the dataframe df remains with only the columns a and c.

df.b = 10*df.a   ### rather than the previous df['b'] = 10*df.a ###

What has pandas done and why is my command incorrect?


回答1:


What you did was add an attribute b to your df:

In [70]:
df.b = 10*df.a 
df.b

Out[70]:
0     0
1    20
2    40
3    60
4    80
Name: a, dtype: int32

but we see that no new column has been added:

In [73]:    
df.columns

Out[73]:
Index(['a', 'c'], dtype='object')

which means we get a KeyError if we tried df['b'], to avoid this ambiguity you should always use square brackets when assigning.

for instance if you had a column named index or sum or max then doing df.index would return the index and not the index column, and similarly df.sum and df.max would screw up those df methods.

I strongly advise to always use square brackets, it avoids any ambiguity and the latest ipython is able to resolve column names using square brackets. It's also useful to think of a dataframe as a dict of series in which it makes sense to use square brackets for assigning and returning a column




回答2:


Always use square brackets for assigning columns

Dot notation is a convenience for accessing columns in a dataframe. If they conflict with existing properties (e.g. if you had a column named 'max'), then you need to use square brackets to access that column, e.g. df['max']. You also need to use square brackets when the column name contains spaces, e.g. df['max value'].

A DataFrame is just an object which has the usual properties and methods. If you use dot notation for assignment, you are creating a property or method for the dataframe object. So df.val = 2 will assign df with a property val that has a value of two. This is very different from df['val'] = 2 which creates a new column in the dataframe and assigns each element in that column the value of two.

To be safe, using square bracket notation will always provide the correct result.

As an aside, your columns=list('ac')) doesn't do anything, as you are just creating a variable named columns that is never used. You may have meant df.columns = list('ac'), but you already assigned those in the creation of the dataframe, so I'm not sure what the intent is with this line of code. And remember that dictionaries are unordered, so that pd.DataFrame({'a': [...], 'b': [...]}) could potentially return a dataframe with columns ['b', 'a']. If this were the case, then assigning column names could potentially mix up the column headers.




回答3:


The issue has to do with how properties are handled in python. There is no restriction in python of setting a new properties for a class, so for example you could do something like

df.myspecialstuff = ["dog", "cat", 5]

So when you do assignment like

df.b = 10*df.a

It is ambiguous whether you want to add a property or a new column, and a property is set. The easiest way to actually see what is going on with this is to use pdb and step through the code

import pdb
x = df.a
pdb.run("df.a1 = x")

This will step into the __setattr__() whereas pdb.run("df['a2'] = x") will step into __setitem__()



来源:https://stackoverflow.com/questions/36924407/pandas-column-creation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!