问题
How can I easily compare the distributions of multiple cohorts?
Usually, https://seaborn.pydata.org/generated/seaborn.distplot.html would be a great tool to visually compare distributions. However, due to the size of my dataset, I needed to compress it and only keep the counts.
It was created as:
SELECT age, gender, compress_distributionUDF(collect_list(struct(target_y_n, count, distribution_value))) GROUP BY age, gender
where compress_distributionUDF
simply takes a list of tuples and returns the counts per group.
This leaves me with a list of
Row(distribution_value=60.0, count=314251, target_y_n=0)
nested inside a pandas.Series
, but one per each chohort.
Basically, it is similar to:
pd.DataFrame({'foo':[1,2], 'bar':['first', 'second'], 'baz':[{'target_y_n': 0, 'value': 0.5, 'count':1000},{'target_y_n': 1, 'value': 1, 'count':10000}]})
and I wonder how to compare distributions:
- within a cohort
0
vs.1
oftarget_y_n
- over multiple cohorts
in a way which is visually still understandable and not only a mess.
edit
For a single cohort Plotting pre aggregated data in python could be the answer, but how can multiple cohorts be compared (not just in a loop) as this leads to too many plots to compare?
回答1:
I am still quite confused but we can start from this and see where it goes. From your example, I am focusing on baz
as it is not clear to me what foo
and bar
are (I assume cohorts).
So let focus on baz
and plot the different distributions according to target_y_n
.
sns.catplot('value','count',data=df, kind='bar',hue='target_y_n',dodge=False,ci=None)
sns.catplot('value','count',data=df, kind='box',hue='target_y_n',dodge=False)
plt.bar(df[df['target_y_n']==0]['value'],df[df['target_y_n']==0]['count'],width=1)
plt.bar(df[df['target_y_n']==1]['value'],df[df['target_y_n']==1]['count'],width=1)
plt.legend(['Target=0','Target=1'])
sns.barplot('value','count',data=df, hue = 'target_y_n',dodge=False,ci=None)
Finally try to have a look at the FacetGrid
class to extend your comparison (see here).
g=sns.FacetGrid(df,col='target_y_n',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
In your case you would have something like:
g=sns.FacetGrid(df,col='target_y_n',row='cohort',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
And a qqplot option:
from scipy import stats
def qqplot(x, y, **kwargs):
_, xr = stats.probplot(x, fit=False)
_, yr = stats.probplot(y, fit=False)
plt.scatter(xr, yr, **kwargs)
g=sns.FacetGrid(df,col='cohort',hue = 'target_y_n')
g=g.map(qqplot,'value','count')
来源:https://stackoverflow.com/questions/55581469/comapring-compressed-distribution-per-cohort