问题
I'm struggling to understand the concept behind column naming conventions, given that one of the following attempts to create a new column appears to fail:
from numpy.random import randn
import pandas as pd
df = pd.DataFrame({'a':range(0,10,2), 'c':range(0,1000,200)},
columns=list('ac'))
df['b'] = 10*df.a
df
gives the following result:
Yet, if I were to try to create column b by substituting with the following line, there is no error message, yet the dataframe df remains with only the columns a and c.
df.b = 10*df.a ### rather than the previous df['b'] = 10*df.a ###
What has pandas done and why is my command incorrect?
回答1:
What you did was add an attribute b
to your df:
In [70]:
df.b = 10*df.a
df.b
Out[70]:
0 0
1 20
2 40
3 60
4 80
Name: a, dtype: int32
but we see that no new column has been added:
In [73]:
df.columns
Out[73]:
Index(['a', 'c'], dtype='object')
which means we get a KeyError
if we tried df['b']
, to avoid this ambiguity you should always use square brackets when assigning.
for instance if you had a column named index
or sum
or max
then doing df.index
would return the index and not the index column, and similarly df.sum
and df.max
would screw up those df methods.
I strongly advise to always use square brackets, it avoids any ambiguity and the latest ipython is able to resolve column names using square brackets. It's also useful to think of a dataframe as a dict of series in which it makes sense to use square brackets for assigning and returning a column
回答2:
Always use square brackets for assigning columns
Dot notation is a convenience for accessing columns in a dataframe. If they conflict with existing properties (e.g. if you had a column named 'max'), then you need to use square brackets to access that column, e.g. df['max']
. You also need to use square brackets when the column name contains spaces, e.g. df['max value']
.
A DataFrame is just an object which has the usual properties and methods. If you use dot notation for assignment, you are creating a property or method for the dataframe object. So df.val = 2
will assign df
with a property val
that has a value of two. This is very different from df['val'] = 2
which creates a new column in the dataframe and assigns each element in that column the value of two.
To be safe, using square bracket notation will always provide the correct result.
As an aside, your columns=list('ac'))
doesn't do anything, as you are just creating a variable named columns
that is never used. You may have meant df.columns = list('ac')
, but you already assigned those in the creation of the dataframe, so I'm not sure what the intent is with this line of code. And remember that dictionaries are unordered, so that pd.DataFrame({'a': [...], 'b': [...]})
could potentially return a dataframe with columns ['b', 'a']. If this were the case, then assigning column names could potentially mix up the column headers.
回答3:
The issue has to do with how properties are handled in python. There is no restriction in python of setting a new properties for a class, so for example you could do something like
df.myspecialstuff = ["dog", "cat", 5]
So when you do assignment like
df.b = 10*df.a
It is ambiguous whether you want to add a property or a new column, and a property is set. The easiest way to actually see what is going on with this is to use pdb and step through the code
import pdb
x = df.a
pdb.run("df.a1 = x")
This will step into the __setattr__()
whereas pdb.run("df['a2'] = x")
will step into __setitem__()
来源:https://stackoverflow.com/questions/36924407/pandas-column-creation