quantile normalization on pandas dataframe

前端 未结 7 765
余生分开走
余生分开走 2020-12-14 21:44

Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?

PS. I know that there is a package named rpy2 w

相关标签:
7条回答
  • 2020-12-14 21:58

    The code below gives identical result as preprocessCore::normalize.quantiles.use.target and I find it simpler clearer than the solutions above. Also performance should be good up to huge array lengths.

    import numpy as np
    
    def quantile_normalize_using_target(x, target):
        """
        Both `x` and `target` are numpy arrays of equal lengths.
        """
    
        target_sorted = np.sort(target)
    
        return target_sorted[x.argsort().argsort()]
    

    Once you have a pandas.DataFrame easy to do:

    quantile_normalize_using_target(df[0].as_matrix(),
                                    df[1].as_matrix())
    

    (Normalizing the first columnt to the second one as a reference distribution in the example above.)

    0 讨论(0)
  • 2020-12-14 22:02

    Using the example dataset from Wikipedia article:

    df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                       'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                       'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})
    
    df
    Out: 
       C1  C2  C3
    A   5   4   3
    B   2   1   4
    C   3   4   6
    D   4   2   8
    

    For each rank, the mean value can be calculated with the following:

    rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()
    
    rank_mean
    Out: 
    1    2.000000
    2    3.000000
    3    4.666667
    4    5.666667
    dtype: float64
    

    Then the resulting Series, rank_mean, can be used as a mapping for the ranks to get the normalized results:

    df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
    Out: 
             C1        C2        C3
    A  5.666667  4.666667  2.000000
    B  2.000000  2.000000  3.000000
    C  3.000000  4.666667  4.666667
    D  4.666667  3.000000  5.666667
    
    0 讨论(0)
  • 2020-12-14 22:02

    One thing worth noticing is that both ayhan and shawn's code use the smaller rank mean for ties, but if you use R package processcore's normalize.quantiles() , it would use the mean of rank means for ties.

    Using the above example:

    > df
    
       C1  C2  C3
    A   5   4   3
    B   2   1   4
    C   3   4   6
    D   4   2   8
    
    > normalize.quantiles(as.matrix(df))
    
             C1        C2        C3
    A  5.666667  5.166667  2.000000
    B  2.000000  2.000000  3.000000
    C  3.000000  5.166667  4.666667
    D  4.666667  3.000000  5.666667
    
    0 讨论(0)
  • 2020-12-14 22:03

    Possibly more robust to use the median on each row rather than mean (based on code from Shawn. L):

    def quantileNormalize(df_input):
        df = df_input.copy()
        #compute rank
        dic = {}
        for col in df:
            dic[col] = df[col].sort_values(na_position='first').values
        sorted_df = pd.DataFrame(dic)
        #rank = sorted_df.mean(axis = 1).tolist()
        rank = sorted_df.median(axis = 1).tolist()
        #sort
        for col in df:
            # compute percentile rank [0,1] for each score in column 
            t = df[col].rank( pct=True, method='max' ).values
            # replace percentile values in column with quantile normalized score
            # retrieve q_norm score using calling rank with percentile value
            df[col] = [ np.nanpercentile( rank, i*100 ) if ~np.isnan(i) else np.nan for i in t ]
        return df
    
    0 讨论(0)
  • 2020-12-14 22:18

    As pointed out by @msg, none of the solutions here take ties into account. I made a python package called qnorm which handles ties, and correctly recreates the Wikipedia quantile normalization example:

    import pandas as pd
    import qnorm
    
    df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                       'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                       'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})
    
    print(qnorm.quantile_normalize(df))
             C1        C2        C3
    A  5.666667  5.166667  2.000000
    B  2.000000  2.000000  3.000000
    C  3.000000  5.166667  4.666667
    D  4.666667  3.000000  5.666667
    

    Installation can be done with either pip or conda

    pip install qnorm
    

    or

    conda config --add channels conda-forge
    conda install qnorm
    
    0 讨论(0)
  • 2020-12-14 22:22

    Ok I implemented the method myself of relatively high efficiency.

    After finishing, this logic seems kind of easy but, anyway, I decided to post it here for any one feels confused like I was when I couldn't googled the available code.

    The code is in github: Quantile Normalize

    0 讨论(0)
提交回复
热议问题