Creating a partial SAS PROC SUMMARY replacement in Python/Pandas

后端 未结 1 739
生来不讨喜
生来不讨喜 2021-01-04 20:32

We are working to get off of SAS and onto Python/Pandas. However, one thing we are having trouble with is creating a replacement for PROC SUMMARY (AKA PR

相关标签:
1条回答
  • 2021-01-04 20:46

    Well, here's a quickie that does get at two issues (but still requires a different function for weighted mean). Mostly it uses the trick here (credit to @DSM) to get around your empty group by doing groupby(lamda x: True). It would be great if there was a kwarg for 'weights' on stuff like means but there is not, to my knowledge. Apparently there is a package for weighted quantiles mentioned here based on numpy but I don't know anything about it. Great project btw!

    (note that names are mostly the same as yours, I just added a '2' to wmean_grouped and my_summary, otherwise you can use the same calling interface)

    def wmean_grouped2 (group, var_name_in, var_name_weight):
        d = group[var_name_in]
        w = group[var_name_weight]
        return (d * w).sum() / w.sum()
    
    FUNCS = { "mean"  : np.mean ,
              "sum"   : np.sum ,
              "count" : np.count_nonzero }
    
    def my_summary2 (
            data ,
            var_names_in ,
            var_names_out ,
            var_functions ,
            var_name_weight = None ,
            var_names_group = None ):
    
        result = pd.DataFrame()
    
        if var_names_group is None:
            grouped = data.groupby (lambda x: True)
        else:
            grouped = data.groupby (var_names_group)
    
        for var_name_in, var_name_out, var_function in \
                zip(var_names_in,var_names_out,var_functions):
            if var_function == "wmean":
                func = lambda x : wmean_grouped2 (x, var_name_in, var_name_weight)
                result[var_name_out] = pd.Series(grouped.apply(func))
            else:
                func = FUNCS[var_function]
                result[var_name_out] = grouped[var_name_in].apply(func)
    
        return result
    
    0 讨论(0)
提交回复
热议问题