Pandas: convert dtype 'object' to int

后端 未结 8 1945
南旧
南旧 2020-11-29 02:02

I\'ve read an SQL query into Pandas and the values are coming in as dtype \'object\', although they are strings, dates and integers. I am able to convert the date \'object\'

相关标签:
8条回答
  • 2020-11-29 02:10

    Cannot comment so posting this as an answer, which is somewhat in between @piRSquared/@cyril's solution and @cs95's:

    As noted by @cs95, if your data contains NaNs or Nones, converting to string type will throw an error when trying to convert to int afterwards.

    However, if your data consists of (numerical) strings, using convert_dtypes will convert it to string type unless you use pd.to_numeric as suggested by @cs95 (potentially combined with df.apply()).

    In the case that your data consists only of numerical strings (including NaNs or Nones but without any non-numeric "junk"), a possibly simpler alternative would be to convert first to float and then to one of the nullable-integer extension dtypes provided by pandas (already present in version 0.24) (see also this answer):

    df['purchase'].astype(float).astype('Int64')
    

    Note that there has been recent discussion on this on github (currently an -unresolved- closed issue though) and that in the case of very long 64-bit integers you may have to convert explicitly to float128 to avoid approximations during the conversions.

    0 讨论(0)
  • 2020-11-29 02:11

    It's simple

    pd.factorize(df.purchase)[0]
    

    Example:

    labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])`
    
    labels
    # array([0, 0, 1, 2, 0])
    
    uniques
    # array(['b', 'a', 'c'], dtype=object)
    
    0 讨论(0)
  • 2020-11-29 02:21

    My train data contains three features are object after applying astype it converts the object into numeric but before that, you need to perform some preprocessing steps:

    train.dtypes
    
    C12       object
    C13       object
    C14       Object
    
    train['C14'] = train.C14.astype(int)
    
    train.dtypes
    
    C12       object
    C13       object
    C14       int32
    
    0 讨论(0)
  • 2020-11-29 02:22

    This was my data

    ## list of columns 
    l1 = ['PM2.5', 'PM10', 'TEMP', 'BP', ' RH', 'WS','CO', 'O3', 'Nox', 'SO2'] 
    
    for i in l1:
     for j in range(0, 8431): #rows = 8431
       df[i][j] = int(df[i][j])
    

    I recommend you to use this only with small data. This code has complexity of O(n^2).

    0 讨论(0)
  • 2020-11-29 02:24

    In my case, I had a df with mixed data:

    df:
                         0   1   2    ...                  242                  243                  244
    0   2020-04-22T04:00:00Z   0   0  ...          3,094,409.5         13,220,425.7          5,449,201.1
    1   2020-04-22T06:00:00Z   0   0  ...          3,716,941.5          8,452,012.9          6,541,599.9
    ....
    

    The floats are actually objects, but I need them to be real floats.

    To fix it, referencing @AMC's comment above:

    def coerce_to_float(val):
        try:
           return float(val)
        except ValueError:
           return val
    
    df = df.applymap(lambda x: coerce_to_float(x))
    
    0 讨论(0)
  • 2020-11-29 02:28

    pandas >= 1.0

    convert_dtypes

    The (self) accepted answer doesn't take into consideration the possibility of NaNs in object columns.

    df = pd.DataFrame({
         'a': [1, 2, np.nan], 
         'b': [True, False, np.nan]}, dtype=object) 
    df                                                                         
    
         a      b
    0    1   True
    1    2  False
    2  NaN    NaN
    
    df['a'].astype(str).astype(int) # raises ValueError
    

    This chokes because the NaN is converted to a string "nan", and further attempts to coerce to integer will fail. To avoid this issue, we can soft-convert columns to their corresponding nullable type using convert_dtypes:

    df.convert_dtypes()                                                        
    
          a      b
    0     1   True
    1     2  False
    2  <NA>   <NA>
    
    df.convert_dtypes().dtypes                                                 
    
    a      Int64
    b    boolean
    dtype: object
    

    If your data has junk text mixed in with your ints, you can use pd.to_numeric as an initial step:

    s = pd.Series(['1', '2', '...'])
    s.convert_dtypes()  # converts to string, which is not what we want
    
    0      1
    1      2
    2    ...
    dtype: string 
    
    # coerces non-numeric junk to NaNs
    pd.to_numeric(s, errors='coerce')
    
    0    1.0
    1    2.0
    2    NaN
    dtype: float64
    
    # one final `convert_dtypes` call to convert to nullable int
    pd.to_numeric(s, errors='coerce').convert_dtypes() 
    
    0       1
    1       2
    2    <NA>
    dtype: Int64
    
    0 讨论(0)
提交回复
热议问题