Using describe() with weighted data — mean, standard deviation, median, quantiles

烈酒焚心 提交于 2020-01-20 06:00:05

问题


I'm fairly new to python and pandas (from using SAS as my workhorse analytical platform), so I apologize in advance if this has already been asked / answered. (I've searched through the documentation as well as this site searching for answer and haven't been able to find something yet.)

I've got a dataframe (called resp) containing respondent level survey data. I want to perform some basic descriptive statistics on one of the fields (called anninc [short for annual income]).

resp["anninc"].describe()

Which gives me the basic stats:

count     76310.000000
mean      43455.874862
std       33154.848314
min           0.000000
25%       20140.000000
50%       34980.000000
75%       56710.000000
max      152884.330000
dtype: float64

But there's a catch. Given how the sample was built, there was a need to weight adjust the respondent data so that not every one is deemed as "equal" when performing the analysis. I have another column in the dataframe (called tufnwgrp) that represents the weight that should be applied to each record during the analysis.

In my prior SAS life, most of the proc's have options to process data with weights like this. For example, a standard proc univariate to give the same results would look something like this:

proc univariate data=resp;
  var anninc;
  output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count;
run;

And the same analysis using weighted data would look something like this:

proc univariate data=resp;
  var anninc;
  weight tufnwgrp;
  output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count
run;

Is there a similar sort of weighting option available in pandas for methods like describe() etc?


回答1:


There is statistics and econometrics library (statsmodels) that appears to handle this. Here's an example that extends @MSeifert's answer here on a similar question.

df=pd.DataFrame({ 'x':range(1,101), 'wt':range(1,101) })

from statsmodels.stats.weightstats import DescrStatsW
wdf = DescrStatsW(df.x, weights=df.wt, ddof=1) 

print( wdf.mean )
print( wdf.std )
print( wdf.quantile([0.25,0.50,0.75]) )

67.0
23.6877840059
p
0.25    50
0.50    71
0.75    87

I don't use SAS, but this gives the same answer as the stata command:

sum x [fw=wt], detail

Stata actually has a few weight options and in this case gives a slightly different answer if you specify aw (analytical weights) instead of fw (frequency weights). Also, stata requires fw to be an integer whereas DescrStatsW allows non-integer weights. Weights are more complicated than you'd think... This is starting to get into the weeds, but there is a great discussion of weighting issues for calculating the standard deviation here.

Also note that DescrStatsW does not appear to include functions for min and max, but as long as your weights are non-zero this should not be a problem as the weights don't affect the min and max. However, if you did have some zero weights, it might be nice to have weighted min and max, but it's also easy to calculate in pandas:

df.x[ df.wt > 0 ].min()
df.x[ df.wt > 0 ].max()


来源:https://stackoverflow.com/questions/17689099/using-describe-with-weighted-data-mean-standard-deviation-median-quantil

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!