问题

How can I easily compare the distributions of multiple cohorts?

Usually, https://seaborn.pydata.org/generated/seaborn.distplot.html would be a great tool to visually compare distributions. However, due to the size of my dataset, I needed to compress it and only keep the counts.

It was created as:

SELECT age, gender, compress_distributionUDF(collect_list(struct(target_y_n, count, distribution_value))) GROUP BY age, gender

where compress_distributionUDF simply takes a list of tuples and returns the counts per group.

This leaves me with a list of

Row(distribution_value=60.0, count=314251, target_y_n=0)

nested inside a pandas.Series, but one per each chohort.

Basically, it is similar to:

pd.DataFrame({'foo':[1,2], 'bar':['first', 'second'], 'baz':[{'target_y_n': 0, 'value': 0.5, 'count':1000},{'target_y_n': 1, 'value': 1, 'count':10000}]})

and I wonder how to compare distributions:

within a cohort 0 vs. 1 of target_y_n
over multiple cohorts

in a way which is visually still understandable and not only a mess.

edit

For a single cohort Plotting pre aggregated data in python could be the answer, but how can multiple cohorts be compared (not just in a loop) as this leads to too many plots to compare?

回答1:

I am still quite confused but we can start from this and see where it goes. From your example, I am focusing on baz as it is not clear to me what foo and bar are (I assume cohorts).
So let focus on baz and plot the different distributions according to target_y_n.

sns.catplot('value','count',data=df, kind='bar',hue='target_y_n',dodge=False,ci=None)

sns.catplot('value','count',data=df, kind='box',hue='target_y_n',dodge=False)

plt.bar(df[df['target_y_n']==0]['value'],df[df['target_y_n']==0]['count'],width=1)
plt.bar(df[df['target_y_n']==1]['value'],df[df['target_y_n']==1]['count'],width=1)
plt.legend(['Target=0','Target=1'])

sns.barplot('value','count',data=df, hue = 'target_y_n',dodge=False,ci=None)

Finally try to have a look at the FacetGrid class to extend your comparison (see here).

g=sns.FacetGrid(df,col='target_y_n',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)

In your case you would have something like:

g=sns.FacetGrid(df,col='target_y_n',row='cohort',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)

And a qqplot option:

from scipy import stats
def qqplot(x, y, **kwargs):
     _, xr = stats.probplot(x, fit=False)
     _, yr = stats.probplot(y, fit=False)
 plt.scatter(xr, yr, **kwargs)

g=sns.FacetGrid(df,col='cohort',hue = 'target_y_n')
g=g.map(qqplot,'value','count')

来源：https://stackoverflow.com/questions/55581469/comapring-compressed-distribution-per-cohort

标签

pandas