We are working to get off of SAS and onto Python/Pandas. However, one thing we are having trouble with is creating a replacement for PROC SUMMARY
(AKA PR
Well, here's a quickie that does get at two issues (but still requires a different function for weighted mean). Mostly it uses the trick here (credit to @DSM) to get around your empty group by doing groupby(lamda x: True)
. It would be great if there was a kwarg for 'weights' on stuff like means but there is not, to my knowledge. Apparently there is a package for weighted quantiles mentioned here based on numpy but I don't know anything about it. Great project btw!
(note that names are mostly the same as yours, I just added a '2' to wmean_grouped and my_summary, otherwise you can use the same calling interface)
def wmean_grouped2 (group, var_name_in, var_name_weight):
d = group[var_name_in]
w = group[var_name_weight]
return (d * w).sum() / w.sum()
FUNCS = { "mean" : np.mean ,
"sum" : np.sum ,
"count" : np.count_nonzero }
def my_summary2 (
data ,
var_names_in ,
var_names_out ,
var_functions ,
var_name_weight = None ,
var_names_group = None ):
result = pd.DataFrame()
if var_names_group is None:
grouped = data.groupby (lambda x: True)
else:
grouped = data.groupby (var_names_group)
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
if var_function == "wmean":
func = lambda x : wmean_grouped2 (x, var_name_in, var_name_weight)
result[var_name_out] = pd.Series(grouped.apply(func))
else:
func = FUNCS[var_function]
result[var_name_out] = grouped[var_name_in].apply(func)
return result