问题
I've got too many features in a data frame. I'm trying to plot ONLY the features which are correlated over a certain threshold, let's say over 80%, and show those in a heatmap. I put some code together, and it runs, but I still see some white lines, which have no data, and thus no correlation. Also, I'm seeing things that are well under 80% correlation. Here is the code that I tried.
import seaborn
c = newdf.corr()
plt.figure(figsize=(10,10))
seaborn.heatmap(c, cmap='RdYlGn_r', mask = (np.abs(c) >= 0.8))
plt.show()
When I run that, I see this.
What is wrong here?
I am making a small update, with some new findings.
This gets ONLY corr>.8.
corr = newdf.corr()
kot = corr[corr>=.8]
plt.figure(figsize=(12,8))
sns.heatmap(kot, cmap="Reds")
That seems to work, but it still gives me a lot of white! I thought there should be a way to include only the items that have a correlation over a certain amount. Maybe you have to copy those items with >.8 items to a new data frame and build the correlation off of that object. I'm not sure how this works.
回答1:
The following code groups the strongly correlated features (with correlation above 0.8 in magnitude) into components and plots the correlation for each group of components individually. Please let me know if it differs from what you want.
components = list()
visited = set()
print(newdf.columns)
for col in newdf.columns:
if col in visited:
continue
component = set([col, ])
just_visited = [col, ]
visited.add(col)
while just_visited:
c = just_visited.pop(0)
for idx, val in corr[c].items():
if abs(val) > 0.999 and idx not in visited:
just_visited.append(idx)
visited.add(idx)
component.add(idx)
components.append(component)
for component in components:
plt.figure(figsize=(12,8))
sns.heatmap(corr.loc[component, component], cmap="Reds")
来源:https://stackoverflow.com/questions/64019509/how-can-we-show-only-features-that-are-correlated-over-a-certain-threshold-in-a