Retaining categorical dtype upon dataframe concatenation

后端 未结 3 1294
谎友^
谎友^ 2020-12-16 16:29

I have two dataframes with identical column names and dtypes, similar to the following:

A             object
B             category
C             category


        
相关标签:
3条回答
  • 2020-12-16 16:50

    JohnE's answer is helpful, but in pandas 0.19.2, union_categoricals can only be imported as follow: from pandas.types.concat import union_categoricals

    0 讨论(0)
  • 2020-12-16 16:59

    I don't think this is completely obvious from the documentation, but you could do something like the following. Here's some sample data:

    df1=pd.DataFrame({'x':pd.Categorical(['dog','cat'])})
    df2=pd.DataFrame({'x':pd.Categorical(['cat','rat'])})
    

    Use union_categoricals1 to get consistent categories accros dataframes. Try df.x.cat.codes if you need to convince yourself that this works.

    from pandas.api.types import union_categoricals
    
    uc = union_categoricals([df1.x,df2.x])
    df1.x = pd.Categorical( df1.x, categories=uc.categories )
    df2.x = pd.Categorical( df2.x, categories=uc.categories )
    

    Concatenate and verify the dtype is categorical.

    df3 = pd.concat([df1,df2])
    
    df3.x.dtypes
    category
    

    As @C8H10N4O2 suggests, you could also just coerce from objects back to categoricals after concatenating. Honestly, for smaller datasets I think that's the best way to do it just because it's simpler. But for larger dataframes, using union_categoricals should be much more memory efficient.

    0 讨论(0)
  • 2020-12-16 17:01

    To complement JohnE's answer, here's a function that does the job by converting to union_categoricals all the category columns present on all input dataframes:

    def concatenate(dfs):
        """Concatenate while preserving categorical columns.
    
        NB: We change the categories in-place for the input dataframes"""
        from pandas.api.types import union_categoricals
        import pandas as pd
        # Iterate on categorical columns common to all dfs
        for col in set.intersection(
            *[
                set(df.select_dtypes(include='category').columns)
                for df in dfs
            ]
        ):
            # Generate the union category across dfs for this column
            uc = union_categoricals([df[col] for df in dfs])
            # Change to union category for all dataframes
            for df in dfs:
                df[col] = pd.Categorical(df[col].values, categories=uc.categories)
        return pd.concat(dfs)
    

    Note the categories are changed in place in the input list:

    df1=pd.DataFrame({'a': [1, 2],
                      'x':pd.Categorical(['dog','cat']),
                      'y': pd.Categorical(['banana', 'bread'])})
    df2=pd.DataFrame({'x':pd.Categorical(['rat']),
                      'y': pd.Categorical(['apple'])})
    
    concatenate([df1, df2]).dtypes
    
    0 讨论(0)
提交回复
热议问题