Convert Pandas column containing NaNs to dtype `int`

后端未结

关注

 17  2224

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The probl

相关标签:

17条回答

执念已碎

2020-11-22 11:51
If you want to use it when you chain methods, you can use assign:
```
df = (
     df.assign(col = lambda x: x['col'].astype('Int64'))
)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
予麋鹿

2020-11-22 11:52
I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.
```
for col in discrete:
df[col] = pd.to_numeric(df[col], errors='coerce').astype(pd.Int64Dtype())
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
感动是毒

2020-11-22 11:53
If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write
```
if row['id']:
   regular_process(row)
else:
   special_process(row)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
春和景丽

2020-11-22 11:55
You could use .dropna() if it is OK to drop the rows with the NaN values.
```
df = df.dropna(subset=['id'])
```
Alternatively, use .fillna() and .astype() to replace the NaN with values and convert them to int.

I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.

My solution was to use str as the intermediate type. Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.
```
df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)
```
For the illustration, here is an example how floats may loose the precision:
```
s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)
```
And the output is:
```
1.2345678901234567e+19 12345678901234567168 12345678901234567890
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
轮回少年

2020-11-22 11:58
Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.
```
keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2020-11-22 12:05
In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.

Nullable Integer Data Type.

Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:
```
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)

0      1
1      2
2    NaN
dtype: Int64
```
For convert column to nullable integers use:
```
df['myCol'] = df['myCol'].astype('Int64')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...