pandas dataframe convert column type to string or categorical

后端 未结 4 821
我在风中等你
我在风中等你 2020-12-23 19:31

How do I convert a single column of a pandas dataframe to type string? In the df of housing data below I need to convert zipcode to string so that when I run linear regressi

4条回答
  •  礼貌的吻别
    2020-12-23 20:07

    With pandas >= 1.0 there is now a dedicated string datatype:

    1) You can convert your column to this pandas string datatype using .astype('string'):

    df['zipcode'] = df['zipcode'].astype('string')
    


    2) This is different from using str which sets the pandas object datatype:

    df['zipcode'] = df['zipcode'].astype(str)
    


    3) For changing into categorical datatype use:

    df['zipcode'] = df['zipcode'].astype('category')
    

    You can see this difference in datatypes when you look at the info of the dataframe:

    df = pd.DataFrame({
        'zipcode_str': [90210, 90211] ,
        'zipcode_string': [90210, 90211],
        'zipcode_category': [90210, 90211],
    })
    
    df['zipcode_str'] = df['zipcode_str'].astype(str)
    df['zipcode_string'] = df['zipcode_str'].astype('string')
    df['zipcode_category'] = df['zipcode_category'].astype('category')
    
    df.info()
    
    # you can see that the first column has dtype object
    # while the second column has the new dtype string
    # the third column has dtype category
     #   Column            Non-Null Count  Dtype   
    ---  ------            --------------  -----   
     0   zipcode_str       2 non-null      object  
     1   zipcode_string    2 non-null      string  
     2   zipcode_category  2 non-null      category
    dtypes: category(1), object(1), string(1)
    


    From the docs:

    The 'string' extension type solves several issues with object-dtype NumPy arrays:

    1) You can accidentally store a mixture of strings and non-strings in an object dtype array. A StringArray can only store strings.

    2) object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text, but still object-dtype columns.

    3) When reading code, the contents of an object dtype array is less clear than string.


    Information about pandas 1.0 can be found here:
    https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html

提交回复
热议问题