comapring compressed distribution per cohort

让人想犯罪 __ 提交于 2020-01-06 07:59:07

问题


How can I easily compare the distributions of multiple cohorts?

Usually, https://seaborn.pydata.org/generated/seaborn.distplot.html would be a great tool to visually compare distributions. However, due to the size of my dataset, I needed to compress it and only keep the counts.

It was created as:

SELECT age, gender, compress_distributionUDF(collect_list(struct(target_y_n, count, distribution_value))) GROUP BY age, gender

where compress_distributionUDF simply takes a list of tuples and returns the counts per group.

This leaves me with a list of

Row(distribution_value=60.0, count=314251, target_y_n=0)

nested inside a pandas.Series, but one per each chohort.

Basically, it is similar to:

pd.DataFrame({'foo':[1,2], 'bar':['first', 'second'], 'baz':[{'target_y_n': 0, 'value': 0.5, 'count':1000},{'target_y_n': 1, 'value': 1, 'count':10000}]})

and I wonder how to compare distributions:

  • within a cohort 0 vs. 1 of target_y_n
  • over multiple cohorts

in a way which is visually still understandable and not only a mess.

edit

For a single cohort Plotting pre aggregated data in python could be the answer, but how can multiple cohorts be compared (not just in a loop) as this leads to too many plots to compare?


回答1:


I am still quite confused but we can start from this and see where it goes. From your example, I am focusing on baz as it is not clear to me what foo and bar are (I assume cohorts).
So let focus on baz and plot the different distributions according to target_y_n.

sns.catplot('value','count',data=df, kind='bar',hue='target_y_n',dodge=False,ci=None)

sns.catplot('value','count',data=df, kind='box',hue='target_y_n',dodge=False)

plt.bar(df[df['target_y_n']==0]['value'],df[df['target_y_n']==0]['count'],width=1)
plt.bar(df[df['target_y_n']==1]['value'],df[df['target_y_n']==1]['count'],width=1)
plt.legend(['Target=0','Target=1'])

sns.barplot('value','count',data=df, hue = 'target_y_n',dodge=False,ci=None)

Finally try to have a look at the FacetGrid class to extend your comparison (see here).

g=sns.FacetGrid(df,col='target_y_n',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)

In your case you would have something like:

g=sns.FacetGrid(df,col='target_y_n',row='cohort',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)

And a qqplot option:

from scipy import stats
def qqplot(x, y, **kwargs):
     _, xr = stats.probplot(x, fit=False)
     _, yr = stats.probplot(y, fit=False)
 plt.scatter(xr, yr, **kwargs)

g=sns.FacetGrid(df,col='cohort',hue = 'target_y_n')
g=g.map(qqplot,'value','count')



来源:https://stackoverflow.com/questions/55581469/comapring-compressed-distribution-per-cohort

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!