How to correlate an Ordinal Categorical column in pandas?

后端 未结 3 1617
长情又很酷
长情又很酷 2021-01-31 19:13

I have a DataFrame df with a non-numerical column CatColumn.

   A         B         CatColumn
0  381.1396  7.343921  Medium
1  481.3268         


        
相关标签:
3条回答
  • 2021-01-31 19:47

    Basically, there is no a good scientifical way to do it. I would use the following approach: 1. Split the numeric field into n groups, where n = number of groups of the categorical field. 2. Calculate Cramer correlation between the 2 categorical fields.

    0 讨论(0)
  • 2021-01-31 19:51

    The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.

    Lets take the original question dataframe. Make the category columns:

    for i in df.CatColumn.astype('category'):
        df[i] = df.CatColumn == i
    

    Then it is possible to calculate the correlation between every category and other columns:

    df.corr()
    

    Output:

                        A         B    Medium      High  Medium-High
    A            1.000000  0.490608  0.914322 -0.312309    -0.743459
    B            0.490608  1.000000  0.343620  0.548589    -0.945367
    Medium       0.914322  0.343620  1.000000 -0.577350    -0.577350
    High        -0.312309  0.548589 -0.577350  1.000000    -0.333333
    Medium-High -0.743459 -0.945367 -0.577350 -0.333333     1.000000
    
    0 讨论(0)
  • 2021-01-31 20:01

    I am going to strongly disagree with the other comments.

    They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.

    Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.

    rawText = StringIO("""
     A         B         C
    0  100.1396  1.343921  Medium
    1  105.3268  1.786945  Medium
    2  200.3766  9.628746  High
    3  150.2400  4.225647  Medium-High
    """)
    myData = pd.read_csv(rawText, sep = "\s+")
    

    Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:

    In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
    Out[226]: 
              A         B         C
    A  1.000000  0.986493 -0.438466
    B  0.986493  1.000000 -0.579650
    C -0.438466 -0.579650  1.000000
    

    Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?

    What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:

    In[227]: myData['C'] = myData['C'].astype('category')
    myData['C'].cat.categories = [2,0,1]
    myData['C'] = myData['C'].astype('float')
    myData.corr()
    Out[227]: 
              A         B         C
    A  1.000000  0.986493  0.998874
    B  0.986493  1.000000  0.982982
    C  0.998874  0.982982  1.000000
    

    Much better!

    Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.

    Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.

    0 讨论(0)
提交回复
热议问题