Selecting Pandas Columns by dtype

后端 未结 9 2001
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-11-29 01:22

I was wondering if there is an elegant and shorthand way in Pandas DataFrames to select columns by data type (dtype). i.e. Select only int64 columns from a DataFrame.

<
相关标签:
9条回答
  • 2020-11-29 01:30
    df.select_dtypes(include=[np.float64])
    
    0 讨论(0)
  • 2020-11-29 01:33
    df.loc[:, df.dtypes == np.float64]
    
    0 讨论(0)
  • 2020-11-29 01:34

    I'd like to extend existing answer by adding options for selecting all floating dtypes or all integer dtypes:

    Demo:

    np.random.seed(1234)
    
    df = pd.DataFrame({
            'a':np.random.rand(3), 
            'b':np.random.rand(3).astype('float32'), 
            'c':np.random.randint(10,size=(3)).astype('int16'),
            'd':np.arange(3).astype('int32'), 
            'e':np.random.randint(10**7,size=(3)).astype('int64'),
            'f':np.random.choice([True, False], 3),
            'g':pd.date_range('2000-01-01', periods=3)
         })
    

    yields:

    In [2]: df
    Out[2]:
              a         b  c  d        e      f          g
    0  0.191519  0.785359  6  0  7578569  False 2000-01-01
    1  0.622109  0.779976  8  1  7981439   True 2000-01-02
    2  0.437728  0.272593  0  2  2558462   True 2000-01-03
    
    In [3]: df.dtypes
    Out[3]:
    a           float64
    b           float32
    c             int16
    d             int32
    e             int64
    f              bool
    g    datetime64[ns]
    dtype: object
    

    Selecting all floating number columns:

    In [4]: df.select_dtypes(include=['floating'])
    Out[4]:
              a         b
    0  0.191519  0.785359
    1  0.622109  0.779976
    2  0.437728  0.272593
    
    In [5]: df.select_dtypes(include=['floating']).dtypes
    Out[5]:
    a    float64
    b    float32
    dtype: object
    

    Selecting all integer number columns:

    In [6]: df.select_dtypes(include=['integer'])
    Out[6]:
       c  d        e
    0  6  0  7578569
    1  8  1  7981439
    2  0  2  2558462
    
    In [7]: df.select_dtypes(include=['integer']).dtypes
    Out[7]:
    c    int16
    d    int32
    e    int64
    dtype: object
    

    Selecting all numeric columns:

    In [8]: df.select_dtypes(include=['number'])
    Out[8]:
              a         b  c  d        e
    0  0.191519  0.785359  6  0  7578569
    1  0.622109  0.779976  8  1  7981439
    2  0.437728  0.272593  0  2  2558462
    
    In [9]: df.select_dtypes(include=['number']).dtypes
    Out[9]:
    a    float64
    b    float32
    c      int16
    d      int32
    e      int64
    dtype: object
    
    0 讨论(0)
  • 2020-11-29 01:36

    Optionally if you don't want to create a subset of the dataframe during the process, you can directly iterate through the column datatype.

    I haven't benchmarked the code below, assume it will be faster if you work on very large dataset.

    [col for col in df.columns.tolist() if df[col].dtype not in ['object','<M8[ns]']] 
    
    0 讨论(0)
  • 2020-11-29 01:47

    If you want to select int64 columns and then update "in place", you can use:

    int64_cols = [col for col in df.columns if is_int64_dtype(df[col].dtype)]
    df[int64_cols]
    

    For example, notice that I update all the int64 columns in df to zero below:

    In [1]:
    
        import pandas as pd
        from pandas.api.types import is_int64_dtype
    
        df = pd.DataFrame({'a': [1, 2] * 3,
                           'b': [True, False] * 3,
                           'c': [1.0, 2.0] * 3,
                           'd': ['red','blue'] * 3,
                           'e': pd.Series(['red','blue'] * 3, dtype="category"),
                           'f': pd.Series([1, 2] * 3, dtype="int64")})
    
        int64_cols = [col for col in df.columns if is_int64_dtype(df[col].dtype)] 
        print('int64 Cols: ',int64_cols)
    
        print(df[int64_cols])
    
        df[int64_cols] = 0
    
        print(df[int64_cols]) 
    
    Out [1]:
    
        int64 Cols:  ['a', 'f']
    
               a  f
            0  1  1
            1  2  2
            2  1  1
            3  2  2
            4  1  1
            5  2  2
               a  f
            0  0  0
            1  0  0
            2  0  0
            3  0  0
            4  0  0
            5  0  0
    

    Just for completeness:

    df.loc() and df.select_dtypes() are going to give a copy of a slice from the dataframe. This means that if you try to update values from df.select_dtypes(), you will get a SettingWithCopyWarning and no updates will happen to df in place.

    For example, notice when I try to update df using .loc() or .select_dtypes() to select columns, nothing happens:

    In [2]:
    
        df = pd.DataFrame({'a': [1, 2] * 3,
                           'b': [True, False] * 3,
                           'c': [1.0, 2.0] * 3,
                           'd': ['red','blue'] * 3,
                           'e': pd.Series(['red','blue'] * 3, dtype="category"),
                           'f': pd.Series([1, 2] * 3, dtype="int64")})
    
        df_bool = df.select_dtypes(include='bool')
        df_bool.b[0] = False
    
        print(df_bool.b[0])
        print(df.b[0])
    
        df.loc[:, df.dtypes == np.int64].a[0]=7
        print(df.a[0])
    
    Out [2]:
    
        False
        True
        1
    
    0 讨论(0)
  • 2020-11-29 01:48

    Since 0.14.1 there's a select_dtypes method so you can do this more elegantly/generally.

    In [11]: df = pd.DataFrame([[1, 2.2, 'three']], columns=['A', 'B', 'C'])
    
    In [12]: df.select_dtypes(include=['int'])
    Out[12]:
       A
    0  1
    

    To select all numeric types use the numpy dtype numpy.number

    In [13]: df.select_dtypes(include=[np.number])
    Out[13]:
       A    B
    0  1  2.2
    
    In [14]: df.select_dtypes(exclude=[object])
    Out[14]:
       A    B
    0  1  2.2
    
    0 讨论(0)
提交回复
热议问题