Plotting three dimensions of categorical data in Python

问题

My data has three categorical variables I'm trying to visualize:

City (one of five)
Occupation (one of four)
Blood type (one of four)

So far, I've succeeded in grouping the data in a way that I think will be easy to work with:

import numpy as np, pandas as pd

# Make data
cities = ['Tijuana','Las Vegas','Los Angeles','Anaheim','Atlantis']
occupations = ['Doctor','Lawyer','Engineer','Drone security officer']
bloodtypes = ['A','B','AB','O']
df = pd.DataFrame({'City': np.random.choice(cities,500),
                   'Occupation': np.random.choice(occupations,500),
                   'Blood Type':np.random.choice(bloodtypes,500)})

# You need to make a dummy column, otherwise the groupby returns an empty df
df['Dummy'] = np.ones(500)

# This is now what I'd like to plot
df.groupby(by=['City','Occupation','Blood Type']).count().unstack(level=1)

Returns:

                       Dummy
Occupation             Doctor Drone security officer Engineer Lawyer
City        Blood Type
Anaheim     A               7                      7        7      7
            AB              6                     10        8      5
            B               2                     10        4      2
            O               4                      3        3      6
Atlantis    A               6                      5        5      7
            AB             12                      7        7     10
            B               7                      4        7      3
            O               7                      4        6      4
Las Vegas   A               8                      4        8      5
            AB              5                      6        8      9
            B               6                     10        6      6
            O               6                      9        5      9
Los Angeles A               7                      4        8      8
            AB              9                      8        8      8
            B               3                      6        4      1
            O               9                     11       11      9
Tijuana     A               3                      4        5      3
            AB              9                      5        5      7
            B               3                      6        4      9
            O               3                      5        5      8

My goal is to create something like the Seaborn swarmplot shown below, which comes from the Seaborn documentation. Seaborn applies jitter to the quantitative data so that you can see the individual data points and their hues:

With my data, I'd like to plot City on the x-axis and Occupation on the y-axis, applying jitter to each, and then hue by Blood type. However, sns.swarmplot requires one of the axes to be quantitative:

sns.swarmplot(data=df,x='City',y='Occupation',hue='Blood Type')

returns an error.

An acceptable alternative might be to create 20 categorical bar plots, one for each intersection of City and Occupation, which I would do by running a for loop over each category, but I can't imagine how I'd feed that to matplotlib subplots to get them in a 4x5 grid.

The most similar question I could find was in R, and the asker only wanted to indicate the most common value for the third variable, so I didn't get any good ideas from there.

Thanks for any help you can provide.

回答1:

Alright, I got to work on the "acceptable alternative" today and I have found a solution using basically pure matplotlib (but I stuck the Seaborn styling on top of it, just because).

import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import get_cmap
from matplotlib.patches import Patch
import seaborn as sns

# Make data
cities = ['Tijuana','Las Vegas','Los Angeles','Anaheim','Atlantis']
occupations = ['Doctor','Lawyer','Engineer','Drone security officer']
bloodtypes = ['A','B','AB','O']
df = pd.DataFrame({'City': np.random.choice(cities,500),
                   'Occupation': np.random.choice(occupations,500),
                   'Blood Type':np.random.choice(bloodtypes,500)})

# Make a dummy column, otherwise the groupby returns an empty df
df['Dummy'] = np.ones(500)

# This is now what I'd like to plot
grouped = df.groupby(by=['City','Occupation','Blood Type']).count().unstack()

# List of blood types, to use later as categories in subplots
kinds = grouped.columns.levels[1]

# colors for bar graph
colors = [get_cmap('viridis')(v) for v in np.linspace(0,1,len(kinds))]

sns.set(context="talk")
nxplots = len(grouped.index.levels[0])
nyplots = len(grouped.index.levels[1])
fig, axes = plt.subplots(nxplots,
                         nyplots,
                         sharey=True,
                         sharex=True,
                         figsize=(10,12))

fig.suptitle('City, occupation, and blood type')

# plot the data
for a, b in enumerate(grouped.index.levels[0]):
    for i, j in enumerate(grouped.index.levels[1]):
        axes[a,i].bar(kinds,grouped.loc[b,j],color=colors)
        axes[a,i].xaxis.set_ticks([])

axeslabels = fig.add_subplot(111, frameon=False)
plt.tick_params(labelcolor='none', top=False, bottom=False, left=False, right=False)
plt.grid(False)
axeslabels.set_ylabel('City',rotation='horizontal',y=1,weight="bold")
axeslabels.set_xlabel('Occupation',weight="bold")

# x- and y-axis labels
for i, j in enumerate(grouped.index.levels[1]):
    axes[nyplots,i].set_xlabel(j)
for i, j in enumerate(grouped.index.levels[0]):
    axes[i,0].set_ylabel(j)

# Tune this manually to make room for the legend
fig.subplots_adjust(right=0.82)

fig.legend([Patch(facecolor = i) for i in colors],
           kinds,
           title="Blood type",
           loc="center right")

Returns this:

I'd appreciate any feedback, and I'd still love it if someone could provide the preferred solution.

来源：https://stackoverflow.com/questions/58303175/plotting-three-dimensions-of-categorical-data-in-python

标签

python

pandas

seaborn