Insert rows into pandas DataFrame while maintaining column data types

问题

What's the best way to insert new rows into an existing pandas DataFrame while maintaining column data types and, at the same time, giving user-defined fill values for columns that aren't specified? Here's an example:

df = pd.DataFrame({
    'name': ['Bob', 'Sue', 'Tom'],
    'age': [45, 40, 10],
    'weight': [143.2, 130.2, 34.9],
    'has_children': [True, True, False]
})

Assume that I want to add a new record passing just name and age. To maintain data types, I can copy rows from df, modify values and then append df to the copy, e.g.

columns = ('name', 'age')
copy_df = df.loc[0:0, columns].copy()
copy_df.loc[0, columns] = 'Cindy', 42
new_df = copy_df.append(df, sort=False).reset_index(drop=True)

But that converts the bool column to an object.

Here's a really hacky solution that doesn't feel like the "right way" to do this:

columns = ('name', 'age')
copy_df = df.loc[0:0].copy()

missing_remap = {
    'int64': 0,
    'float64': 0.0,
    'bool': False,
    'object': ''
}
for c in set(copy_df.columns).difference(columns)):
    copy_df.loc[:, c] = missing_remap[str(copy_df[c].dtype)]

new_df = copy_df.append(df, sort=False).reset_index(drop=True)
new_df.loc[0, columns] = 'Cindy', 42

I know I must be missing something.

回答1:

As you found, since NaN is a float, adding NaN to a series may cause it to be either upcasted to float or converted to object. You are right in determining this is not a desirable outcome.

There is no straightforward approach. My suggestion is to store your input row data in a dictionary and combine it with a dictionary of defaults before appending. Note that this works because pd.DataFrame.append accepts a dict argument.

In Python 3.6, you can use the syntax {**d1, **d2} to combine two dictionaries with preference for the second.

default = {'name': '', 'age': 0, 'weight': 0.0, 'has_children': False}

row = {'name': 'Cindy', 'age': 42}

df = df.append({**default, **row}, ignore_index=True)

print(df)

   age  has_children   name  weight
0   45          True    Bob   143.2
1   40          True    Sue   130.2
2   10         False    Tom    34.9
3   42         False  Cindy     0.0

print(df.dtypes)

age               int64
has_children       bool
name             object
weight          float64
dtype: object

回答2:

It's because, NaN value is a float, but True and False are bool. There are mixed dtypes in one column, so Pandas will automatically convert it into object.

Another instance of this is, if you have a column with all integer values and append a value with float, then pandas change entire column to float by adding '.0' to the remaining values.

Edit

Based on comments, Another hacky way to convert object to bool dtype.

df = pandas.DataFrame({
    'name': ['Bob', 'Sue', 'Tom'],
    'age': [45, 40, 10],
    'weight': [143.2, 130.2, 34.9],
    'has_children': [True, True, False]
})
row = {'name': 'Cindy', 'age': 12}
df = df.append(row, ignore_index=True)
df['has_children'] = df['has_children'].fillna(False).astype('bool')

Now the new dataframe looks like this :

    age has_children    name    weight
 0  45  True             Bob    143.2
 1  40  True             Sue    130.2
 2  10  False            Tom    34.9
 3  12  False            Cindy  NaN

来源：https://stackoverflow.com/questions/50650850/insert-rows-into-pandas-dataframe-while-maintaining-column-data-types

标签

python

pandas

dataframe

append