I have a DataFrame df
with a non-numerical column CatColumn
.
A B CatColumn
0 381.1396 7.343921 Medium
1 481.3268
The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.
Lets take the original question dataframe. Make the category columns:
for i in df.CatColumn.astype('category'):
df[i] = df.CatColumn == i
Then it is possible to calculate the correlation between every category and other columns:
df.corr()
Output:
A B Medium High Medium-High
A 1.000000 0.490608 0.914322 -0.312309 -0.743459
B 0.490608 1.000000 0.343620 0.548589 -0.945367
Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
High -0.312309 0.548589 -0.577350 1.000000 -0.333333
Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000