问题
Suppose I have a pandas dataframe with multicolumns, like so:
import pandas as pd
iterables = [['a', 'b'], ['1', '2']]
my_index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=my_index)
Then df
produces
first a b
second 1 2 1 2
0 1 2 3 4
1 5 6 7 8
Now if I want the self-correlation of df['a']
with itself, that's straight-forward: df['a'].corr()
gets me that. Note that such a correlation has shape (2, 2)
.
What I would like to do is calculate the correlation matrix of df['a']
with df['b']
. Supposedly, the code df['a'].corrwith(df['b'])
should give me this. This code does run, but the result has shape (2,)
, which doesn't look right to me. Why should the self-correlation matrix given by .corr()
give a result with a different shape than a correlation given by .corrwith()
? I need a correlation matrix of the same shape as df['a'].corr()
, because I want to plot Seaborn heatmaps, and I need the 2D correlation matrix.
Thanks in advance for your time!
回答1:
You want to use the corr() function from the DataFRame, not from the Series.
It would look like:
In [1]:
# Create the Dataframe
import pandas as pd
iterables = [['a', 'b'], ['1', '2']]
my_index = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=my_index)
df
Out [1]:
first a b
second 1 2 1 2
0 1 2 3 4
1 5 6 7 8
In [2]:
## Get the correlation matrix
df.corr()
Out [2]:
first a b
second 1 2 1 2
first second
a 1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
b 1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
EDIT
Documentation
*You can choose the function behind it method : {‘pearson’, ‘kendall’, ‘spearman’} or callable
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient spearman : Spearman rank correlation
callable: callable with input two 1d ndarrays*
回答2:
The key to this problem was to recognize that the result of the .corr()
DataFrame function is itself a pandas DataFrame. If we run the code in the question, and then use the .loc
function, we can get a subset of the correlation matrix. The result of df.corr()
is
first a b
second 1 2 1 2
first second
a 1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
b 1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
and the result of df.corr().loc['a', 'b']
is
second 1 2
second
1 1.0 1.0
2 1.0 1.0
This is what I wanted.
来源:https://stackoverflow.com/questions/57513002/how-do-you-calculate-a-non-self-correlation-matrix-in-pandas-with-multicolumns