Python Pandas - Changing some column types to categories

前端 未结 7 1307
孤独总比滥情好
孤独总比滥情好 2021-01-30 01:08

I have fed the following CSV file into iPython Notebook:

public = pd.read_csv(\"categories.csv\")
public

I\'ve also imported pandas as pd, nump

相关标签:
7条回答
  • 2021-01-30 01:15

    You can use the pandas.DataFrame.apply method along with a lambda expression to solve this. In your example you could use

    df[['parks', 'playgrounds', 'sports']].apply(lambda x: x.astype('category'))
    

    I don't know of a way to execute this inplace, so typically I'll end up with something like this:

    df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))
    

    Obviously you can replace .select_dtypes with explicit column names if you don't want to select all of a certain datatype (although in your example it seems like you wanted all object types).

    0 讨论(0)
  • 2021-01-30 01:20

    Jupyter Notebook

    In my case, I had big Dataframe with many objects that I would like to convert it to category.

    Therefore, what I did is I selected the object columns and filled anything that is NA to missing and then saved it in the original Dataframe as in

    # Convert Object Columns to Categories
    obj_df =df.select_dtypes(include=['object']).copy()
    obj_df=obj_df.fillna('Missing')
    for col in obj_df:
        obj_df[col] = obj_df[col].astype('category')
    df[obj_df.columns]=obj_df[obj_df.columns]
    df.head()
    

    I hope this might be a helpful resource for later reference

    0 讨论(0)
  • 2021-01-30 01:25

    As of pandas 0.19.0, What's New describes that read_csv supports parsing Categorical columns directly. This answer applies only if you're starting from read_csv otherwise, I think unutbu's answer is still best. Example on 10,000 records:

    import pandas as pd
    import numpy as np
    
    # Generate random data, four category-like columns, two int columns
    N=10000
    categories = pd.DataFrame({
                'parks' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
                'playgrounds' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
                'sports' : np.random.choice(['important', 'very important', 'not important'], size=N),
                'roading' : np.random.choice(['important', 'very important', 'not important'], size=N),
                'resident' : np.random.choice([1, 2, 3], size=N),
                'children' : np.random.choice([0, 1, 2, 3], size=N)
                           })
    categories.to_csv('categories_large.csv', index=False)
    

    <0.19.0 (or >=19.0 without specifying dtype)

    pd.read_csv('categories_large.csv').dtypes # inspect default dtypes
    
    children        int64
    parks          object
    playgrounds    object
    resident        int64
    roading        object
    sports         object
    dtype: object
    

    >=0.19.0

    For mixed dtypes parsing as Categorical can be implemented by passing a dictionary dtype={'colname' : 'category', ...} in read_csv.

    pd.read_csv('categories_large.csv', dtype={'parks': 'category',
                                               'playgrounds': 'category',
                                               'sports': 'category',
                                               'roading': 'category'}).dtypes
    children          int64
    parks          category
    playgrounds    category
    resident          int64
    roading        category
    sports         category
    dtype: object
    

    Performance

    A slight speed-up (local jupyter notebook), as mentioned in the release notes.

    # unutbu's answer
    %%timeit
    public = pd.read_csv('categories_large.csv')
    for col in ['parks', 'playgrounds', 'sports', 'roading']:
        public[col] = public[col].astype('category')
    10 loops, best of 3: 20.1 ms per loop
    
    # parsed during read_csv
    %%timeit
    category_cols = {item: 'category' for item in ['parks', 'playgrounds', 'sports', 'roading']}
    public = pd.read_csv('categories_large.csv', dtype=category_cols)
    100 loops, best of 3: 14.3 ms per loop
    
    0 讨论(0)
  • 2021-01-30 01:30

    I found that using a for loop works well.

    for col in ['col_variable_name_1', 'col_variable_name_2', ect..]:
        dataframe_name[col] = dataframe_name[col].astype(float)
    
    0 讨论(0)
  • 2021-01-30 01:32

    Sometimes, you just have to use a for-loop:

    for col in ['parks', 'playgrounds', 'sports', 'roading']:
        public[col] = public[col].astype('category')
    
    0 讨论(0)
  • 2021-01-30 01:32

    To make things easier. No apply. No map. No loop.

    cols=data.select_dtypes(exclude='int').columns.to_list()
    data[cols]=data[cols].astype('category')
    
    0 讨论(0)
提交回复
热议问题