Insert rows into pandas DataFrame while maintaining column data types

柔情痞子 提交于 2020-05-15 03:36:50

问题


What's the best way to insert new rows into an existing pandas DataFrame while maintaining column data types and, at the same time, giving user-defined fill values for columns that aren't specified? Here's an example:

df = pd.DataFrame({
    'name': ['Bob', 'Sue', 'Tom'],
    'age': [45, 40, 10],
    'weight': [143.2, 130.2, 34.9],
    'has_children': [True, True, False]
})

Assume that I want to add a new record passing just name and age. To maintain data types, I can copy rows from df, modify values and then append df to the copy, e.g.

columns = ('name', 'age')
copy_df = df.loc[0:0, columns].copy()
copy_df.loc[0, columns] = 'Cindy', 42
new_df = copy_df.append(df, sort=False).reset_index(drop=True)

But that converts the bool column to an object.

Here's a really hacky solution that doesn't feel like the "right way" to do this:

columns = ('name', 'age')
copy_df = df.loc[0:0].copy()

missing_remap = {
    'int64': 0,
    'float64': 0.0,
    'bool': False,
    'object': ''
}
for c in set(copy_df.columns).difference(columns)):
    copy_df.loc[:, c] = missing_remap[str(copy_df[c].dtype)]

new_df = copy_df.append(df, sort=False).reset_index(drop=True)
new_df.loc[0, columns] = 'Cindy', 42

I know I must be missing something.


回答1:


As you found, since NaN is a float, adding NaN to a series may cause it to be either upcasted to float or converted to object. You are right in determining this is not a desirable outcome.

There is no straightforward approach. My suggestion is to store your input row data in a dictionary and combine it with a dictionary of defaults before appending. Note that this works because pd.DataFrame.append accepts a dict argument.

In Python 3.6, you can use the syntax {**d1, **d2} to combine two dictionaries with preference for the second.

default = {'name': '', 'age': 0, 'weight': 0.0, 'has_children': False}

row = {'name': 'Cindy', 'age': 42}

df = df.append({**default, **row}, ignore_index=True)

print(df)

   age  has_children   name  weight
0   45          True    Bob   143.2
1   40          True    Sue   130.2
2   10         False    Tom    34.9
3   42         False  Cindy     0.0

print(df.dtypes)

age               int64
has_children       bool
name             object
weight          float64
dtype: object



回答2:


It's because, NaN value is a float, but True and False are bool. There are mixed dtypes in one column, so Pandas will automatically convert it into object.

Another instance of this is, if you have a column with all integer values and append a value with float, then pandas change entire column to float by adding '.0' to the remaining values.


Edit

Based on comments, Another hacky way to convert object to bool dtype.

df = pandas.DataFrame({
    'name': ['Bob', 'Sue', 'Tom'],
    'age': [45, 40, 10],
    'weight': [143.2, 130.2, 34.9],
    'has_children': [True, True, False]
})
row = {'name': 'Cindy', 'age': 12}
df = df.append(row, ignore_index=True)
df['has_children'] = df['has_children'].fillna(False).astype('bool')

Now the new dataframe looks like this :

    age has_children    name    weight
 0  45  True             Bob    143.2
 1  40  True             Sue    130.2
 2  10  False            Tom    34.9
 3  12  False            Cindy  NaN


来源:https://stackoverflow.com/questions/50650850/insert-rows-into-pandas-dataframe-while-maintaining-column-data-types

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!