问题

Suppose I have a pandas dataframe with multicolumns, like so:

import pandas as pd
iterables = [['a', 'b'], ['1', '2']]
my_index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=my_index)

Then df produces

first  a   b
second 1 2 1 2
0      1 2 3 4
1      5 6 7 8

Now if I want the self-correlation of df['a'] with itself, that's straight-forward: df['a'].corr() gets me that. Note that such a correlation has shape (2, 2).

What I would like to do is calculate the correlation matrix of df['a'] with df['b']. Supposedly, the code df['a'].corrwith(df['b']) should give me this. This code does run, but the result has shape (2,), which doesn't look right to me. Why should the self-correlation matrix given by .corr() give a result with a different shape than a correlation given by .corrwith()? I need a correlation matrix of the same shape as df['a'].corr(), because I want to plot Seaborn heatmaps, and I need the 2D correlation matrix.

Thanks in advance for your time!

回答1:

You want to use the corr() function from the DataFRame, not from the Series.

It would look like:

In [1]:
# Create the Dataframe
import pandas as pd
iterables = [['a', 'b'], ['1', '2']]
my_index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=my_index)
df

Out [1]:
first     a       b
second  1   2   1   2
0       1   2   3   4
1       5   6   7   8

In [2]:
## Get the correlation matrix
df.corr()

Out [2]:
        first     a           b
        second  1   2       1   2
first   second              
a          1    1.0 1.0     1.0 1.0
           2    1.0 1.0     1.0 1.0
b          1    1.0 1.0     1.0 1.0
           2    1.0 1.0     1.0 1.0

EDIT

Documentation

*You can choose the function behind it method : {‘pearson’, ‘kendall’, ‘spearman’} or callable

pearson : standard correlation coefficient

kendall : Kendall Tau correlation coefficient spearman : Spearman rank correlation

callable: callable with input two 1d ndarrays*

回答2:

The key to this problem was to recognize that the result of the .corr() DataFrame function is itself a pandas DataFrame. If we run the code in the question, and then use the .loc function, we can get a subset of the correlation matrix. The result of df.corr() is

        first   a           b
        second  1   2       1   2
first   second              
a          1    1.0 1.0     1.0 1.0
           2    1.0 1.0     1.0 1.0
b          1    1.0 1.0     1.0 1.0
           2    1.0 1.0     1.0 1.0

and the result of df.corr().loc['a', 'b'] is

second  1    2
second          
1       1.0  1.0
2       1.0  1.0

This is what I wanted.

来源：https://stackoverflow.com/questions/57513002/how-do-you-calculate-a-non-self-correlation-matrix-in-pandas-with-multicolumns

标签

python-3.x

pandas

dataframe

correlation

multi-index

How do you calculate a (non-self) correlation matrix in pandas with multicolumns?

问题

回答1:

EDIT

回答2: