问题
What's the best way to insert new rows into an existing pandas DataFrame while maintaining column data types and, at the same time, giving user-defined fill values for columns that aren't specified? Here's an example:
df = pd.DataFrame({
'name': ['Bob', 'Sue', 'Tom'],
'age': [45, 40, 10],
'weight': [143.2, 130.2, 34.9],
'has_children': [True, True, False]
})
Assume that I want to add a new record passing just name
and age
. To maintain data types, I can copy rows from df
, modify values and then append df
to the copy, e.g.
columns = ('name', 'age')
copy_df = df.loc[0:0, columns].copy()
copy_df.loc[0, columns] = 'Cindy', 42
new_df = copy_df.append(df, sort=False).reset_index(drop=True)
But that converts the bool
column to an object.
Here's a really hacky solution that doesn't feel like the "right way" to do this:
columns = ('name', 'age')
copy_df = df.loc[0:0].copy()
missing_remap = {
'int64': 0,
'float64': 0.0,
'bool': False,
'object': ''
}
for c in set(copy_df.columns).difference(columns)):
copy_df.loc[:, c] = missing_remap[str(copy_df[c].dtype)]
new_df = copy_df.append(df, sort=False).reset_index(drop=True)
new_df.loc[0, columns] = 'Cindy', 42
I know I must be missing something.
回答1:
As you found, since NaN
is a float
, adding NaN
to a series may cause it to be either upcasted to float
or converted to object
. You are right in determining this is not a desirable outcome.
There is no straightforward approach. My suggestion is to store your input row data in a dictionary and combine it with a dictionary of defaults before appending. Note that this works because pd.DataFrame.append
accepts a dict
argument.
In Python 3.6, you can use the syntax {**d1, **d2}
to combine two dictionaries with preference for the second.
default = {'name': '', 'age': 0, 'weight': 0.0, 'has_children': False}
row = {'name': 'Cindy', 'age': 42}
df = df.append({**default, **row}, ignore_index=True)
print(df)
age has_children name weight
0 45 True Bob 143.2
1 40 True Sue 130.2
2 10 False Tom 34.9
3 42 False Cindy 0.0
print(df.dtypes)
age int64
has_children bool
name object
weight float64
dtype: object
回答2:
It's because, NaN value is a float, but True and False are bool. There are mixed dtypes in one column, so Pandas will automatically convert it into object.
Another instance of this is, if you have a column with all integer values and append a value with float, then pandas change entire column to float by adding '.0' to the remaining values.
Edit
Based on comments, Another hacky way to convert object to bool dtype.
df = pandas.DataFrame({
'name': ['Bob', 'Sue', 'Tom'],
'age': [45, 40, 10],
'weight': [143.2, 130.2, 34.9],
'has_children': [True, True, False]
})
row = {'name': 'Cindy', 'age': 12}
df = df.append(row, ignore_index=True)
df['has_children'] = df['has_children'].fillna(False).astype('bool')
Now the new dataframe looks like this :
age has_children name weight
0 45 True Bob 143.2
1 40 True Sue 130.2
2 10 False Tom 34.9
3 12 False Cindy NaN
来源:https://stackoverflow.com/questions/50650850/insert-rows-into-pandas-dataframe-while-maintaining-column-data-types