NumPy or Pandas: Keeping array type as integer while having a NaN value

前端 未结 8 964
粉色の甜心
粉色の甜心 2020-11-22 06:05

Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element ins

相关标签:
8条回答
  • 2020-11-22 06:17

    If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64

    This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls

    import pandas as pd
    import numpy as np
    
    #show datatypes before transformation
    mydf.dtypes
    
    for c in mydf.select_dtypes(np.number).columns:
        try:
            mydf[c] = mydf[c].astype('Int64')
            print('casted {} as Int64'.format(c))
        except:
            print('could not cast {} to Int64'.format(c))
    
    #show datatypes after transformation
    mydf.dtypes
    
    0 讨论(0)
  • 2020-11-22 06:24

    Just wanted to add that in case you are trying to convert a float (1.143) vector to integer (1) that has NA converting to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"

    s1 = pd.Series([1.434, 2.343, np.nan])
    #without round() the next line returns an error 
    s1.astype('Int64')
    #cannot safely cast non-equivalent float64 to int64
    ##with round() it works
    s1.round().astype('Int64')
    0      1
    1      2
    2    NaN
    dtype: Int64
    

    My use case is that I have a float series that I want to round to int, but when you do .round() a '*.0' at the end of the number remains, so you can drop that 0 from the end by converting to int.

    0 讨论(0)
  • 2020-11-22 06:26

    Pandas v0.24+

    Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.

    Pandas v0.23 and earlier

    In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.

    The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:

    s = pd.Series([1, 2, 3, np.nan])
    
    print(s.astype(object))
    
    0      1
    1      2
    2      3
    3    NaN
    dtype: object
    

    For cosmetic reasons, e.g. output to a file, this may be preferable.

    Pandas v0.23 and earlier: background

    NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:

    In the absence of high performance NA support being built into NumPy from the ground up, the primary casualty is the ability to represent NAs in integer arrays.

    This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be “numeric”.

    The docs also provide rules for upcasting due to NaN inclusion:

    Typeclass   Promotion dtype for storing NAs
    floating    no change
    object      no change
    integer     cast to float64
    boolean     cast to object
    
    0 讨论(0)
  • 2020-11-22 06:29

    This is now possible, since pandas v 0.24.0

    pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values.

    0 讨论(0)
  • 2020-11-22 06:31

    NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:

    http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na

    (This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support )

    0 讨论(0)
  • 2020-11-22 06:31

    This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN

    a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
    

    This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected

    0 讨论(0)
提交回复
热议问题