quantile normalization on pandas dataframe

前端 未结 7 764
余生分开走
余生分开走 2020-12-14 21:44

Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?

PS. I know that there is a package named rpy2 w

7条回答
  •  醉梦人生
    2020-12-14 22:03

    Possibly more robust to use the median on each row rather than mean (based on code from Shawn. L):

    def quantileNormalize(df_input):
        df = df_input.copy()
        #compute rank
        dic = {}
        for col in df:
            dic[col] = df[col].sort_values(na_position='first').values
        sorted_df = pd.DataFrame(dic)
        #rank = sorted_df.mean(axis = 1).tolist()
        rank = sorted_df.median(axis = 1).tolist()
        #sort
        for col in df:
            # compute percentile rank [0,1] for each score in column 
            t = df[col].rank( pct=True, method='max' ).values
            # replace percentile values in column with quantile normalized score
            # retrieve q_norm score using calling rank with percentile value
            df[col] = [ np.nanpercentile( rank, i*100 ) if ~np.isnan(i) else np.nan for i in t ]
        return df
    

提交回复
热议问题