Convert Pandas column containing NaNs to dtype `int`

后端 未结 17 2206
终归单人心
终归单人心 2020-11-22 11:18

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The probl

相关标签:
17条回答
  • 2020-11-22 11:51

    If you want to use it when you chain methods, you can use assign:

    df = (
         df.assign(col = lambda x: x['col'].astype('Int64'))
    )
    
    0 讨论(0)
  • 2020-11-22 11:52

    I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.

    for col in discrete:
    df[col] = pd.to_numeric(df[col], errors='coerce').astype(pd.Int64Dtype())
    
    0 讨论(0)
  • 2020-11-22 11:53

    If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write

    if row['id']:
       regular_process(row)
    else:
       special_process(row)
    
    0 讨论(0)
  • 2020-11-22 11:55

    You could use .dropna() if it is OK to drop the rows with the NaN values.

    df = df.dropna(subset=['id'])
    

    Alternatively, use .fillna() and .astype() to replace the NaN with values and convert them to int.

    I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.

    My solution was to use str as the intermediate type. Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.

    df = pd.read_csv(filename, dtype={'id':str})
    df["id"] = df["id"].fillna("0").astype(int)
    

    For the illustration, here is an example how floats may loose the precision:

    s = "12345678901234567890"
    f = float(s)
    i = int(f)
    i2 = int(s)
    print (f, i, i2)
    

    And the output is:

    1.2345678901234567e+19 12345678901234567168 12345678901234567890
    
    0 讨论(0)
  • 2020-11-22 11:58

    Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.

    keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
    
    0 讨论(0)
  • 2020-11-22 12:05

    In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.

    Nullable Integer Data Type.

    Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:

    arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
    pd.Series(arr)
    
    0      1
    1      2
    2    NaN
    dtype: Int64
    

    For convert column to nullable integers use:

    df['myCol'] = df['myCol'].astype('Int64')
    
    0 讨论(0)
提交回复
热议问题