what are all the dtypes that pandas recognizes?

后端 未结 3 2078
醉梦人生
醉梦人生 2020-11-30 02:03

For pandas, would anyone know, if any datatype apart from

(i) float64, int64 (and other variants of np.number like floa

相关标签:
3条回答
  • 2020-11-30 02:54

    EDIT Feb 2020 following pandas 1.0.0 release

    Pandas mostly uses NumPy arrays and dtypes for each Series (a dataframe is a collection of Series, each which can have its own dtype). NumPy's documentation further explains dtype, data types, and data type objects. In addition, the answer provided by @lcameron05 provides an excellent description of the numpy dtypes. Furthermore, the pandas docs on dtypes have a lot of additional information.

    The main types stored in pandas objects are float, int, bool, datetime64[ns], timedelta[ns], and object. In addition these dtypes have item sizes, e.g. int64 and int32.

    By default integer types are int64 and float types are float64, REGARDLESS of platform (32-bit or 64-bit). The following will all result in int64 dtypes.

    Numpy, however will choose platform-dependent types when creating arrays. The following WILL result in int32 on 32-bit platform. One of the major changes to version 1.0.0 of pandas is the introduction of pd.NA to represent scalar missing values (rather than the previous values of np.nan, pd.NaT or None, depending on usage).

    Pandas extends NumPy's type system and also allows users to write their on extension types. The following lists all of pandas extension types.

    1) Time zone handling

    Kind of data: tz-aware datetime (note that NumPy does not support timezone-aware datetimes).

    Data type: DatetimeTZDtype

    Scalar: Timestamp

    Array: arrays.DatetimeArray

    String Aliases: 'datetime64[ns, ]'

    2) Categorical data

    Kind of data: Categorical

    Data type: CategoricalDtype

    Scalar: (none)

    Array: Categorical

    String Aliases: 'category'

    3) Time span representation

    Kind of data: period (time spans)

    Data type: PeriodDtype

    Scalar: Period

    Array: arrays.PeriodArray

    String Aliases: 'period[]', 'Period[]'

    4) Sparse data structures

    Kind of data: sparse

    Data type: SparseDtype

    Scalar: (none)

    Array: arrays.SparseArray

    String Aliases: 'Sparse', 'Sparse[int]', 'Sparse[float]'

    5) IntervalIndex

    Kind of data: intervals

    Data type: IntervalDtype

    Scalar: Interval

    Array: arrays.IntervalArray

    String Aliases: 'interval', 'Interval', 'Interval[]', 'Interval[datetime64[ns, ]]', 'Interval[timedelta64[]]'

    6) Nullable integer data type

    Kind of data: nullable integer

    Data type: Int64Dtype, ...

    Scalar: (none)

    Array: arrays.IntegerArray

    String Aliases: 'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64'

    7) Working with text data

    Kind of data: Strings

    Data type: StringDtype

    Scalar: str

    Array: arrays.StringArray

    String Aliases: 'string'

    8) Boolean data with missing values

    Kind of data: Boolean (with NA)

    Data type: BooleanDtype

    Scalar: bool

    Array: arrays.BooleanArray

    String Aliases: 'boolean'

    0 讨论(0)
  • 2020-11-30 02:54

    Building on other answers, pandas also includes a number of its own dtypes.

    Pandas and third-party libraries extend NumPy’s type system in a few places. This section describes the extensions pandas has made internally. See Extension types for how to write your own extension that works with pandas. See Extension data types for a list of third-party libraries that have implemented an extension.

    The following table lists all of pandas extension types. See the respective document

    https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#dtypes

    Also, pandas 1.0 will have a string dtype.

    0 讨论(0)
  • 2020-11-30 03:01

    pandas borrows its dtypes from numpy. For demonstration of this see the following:

    import pandas as pd
    
    df = pd.DataFrame({'A': [1,'C',2.]})
    df['A'].dtype
    
    >>> dtype('O')
    
    type(df['A'].dtype)
    
    >>> numpy.dtype
    

    You can find the list of valid numpy.dtypes in the documentation:

    '?' boolean

    'b' (signed) byte

    'B' unsigned byte

    'i' (signed) integer

    'u' unsigned integer

    'f' floating-point

    'c' complex-floating point

    'm' timedelta

    'M' datetime

    'O' (Python) objects

    'S', 'a' zero-terminated bytes (not recommended)

    'U' Unicode string

    'V' raw data (void)

    pandas should support these types. Using the astype method of a pandas.Series object with any of the above options as the input argument will result in pandas trying to convert the Series to that type (or at the very least falling back to object type); 'u' is the only one that I see pandas not understanding at all:

    df['A'].astype('u')
    
    >>> TypeError: data type "u" not understood
    

    This is a numpy error that results because the 'u' needs to be followed by a number specifying the number of bytes per item in (which needs to be valid):

    import numpy as np
    
    np.dtype('u')
    
    >>> TypeError: data type "u" not understood
    
    np.dtype('u1')
    
    >>> dtype('uint8')
    
    np.dtype('u2')
    
    >>> dtype('uint16')
    
    np.dtype('u4')
    
    >>> dtype('uint32')
    
    np.dtype('u8')
    
    >>> dtype('uint64')
    
    # testing another invalid argument
    np.dtype('u3')
    
    >>> TypeError: data type "u3" not understood
    

    To summarise, the astype methods of pandas objects will try and do something sensible with any argument that is valid for numpy.dtype. Note that numpy.dtype('f') is the same as numpy.dtype('float32') and numpy.dtype('f8') is the same as numpy.dtype('float64') etc. Same goes for passing the arguments to pandas astype methods.

    To locate the respective data type classes in NumPy, the Pandas docs recommends this:

    def subdtypes(dtype):
        subs = dtype.__subclasses__()
        if not subs:
            return dtype
        return [dtype, [subdtypes(dt) for dt in subs]]
    
    subdtypes(np.generic)
    

    Output:

    [numpy.generic,
     [[numpy.number,
       [[numpy.integer,
         [[numpy.signedinteger,
           [numpy.int8,
            numpy.int16,
            numpy.int32,
            numpy.int64,
            numpy.int64,
            numpy.timedelta64]],
          [numpy.unsignedinteger,
           [numpy.uint8,
            numpy.uint16,
            numpy.uint32,
            numpy.uint64,
            numpy.uint64]]]],
        [numpy.inexact,
         [[numpy.floating,
           [numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
          [numpy.complexfloating,
           [numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
      [numpy.flexible,
       [[numpy.character, [numpy.bytes_, numpy.str_]],
        [numpy.void, [numpy.record]]]],
      numpy.bool_,
      numpy.datetime64,
      numpy.object_]]
    

    Pandas accepts these classes as valid types. For example, dtype={'A': np.float}.

    NumPy docs contain more details and a chart:

    0 讨论(0)
提交回复
热议问题