quantile normalization on pandas dataframe

前端 未结 7 766
余生分开走
余生分开走 2020-12-14 21:44

Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?

PS. I know that there is a package named rpy2 w

相关标签:
7条回答
  • 2020-12-14 22:22

    I am new to pandas and late to the question, but I think answer might also be of use. It builds off of the great answer from @ayhan:

    def quantile_normalize(dataframe, cols, pandas=pd):
    
        # copy dataframe and only use the columns with numerical values
        df = dataframe.copy().filter(items=cols)
    
        # columns from the original dataframe not specified in cols
        non_numeric = dataframe.filter(items=list(filter(lambda col: col not in cols, list(dataframe))))
    
    
        rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()  
    
        norm = df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
    
    
        result = pandas.concat([norm, non_numeric], axis=1)
        return result
    

    the main difference here is closer to some real world applications. Often you just have matrices of numerical data in which case the original answer is sufficient.

    Sometimes you have text based data in there as well. This lets you specify the columns cols of your numerical data and will run quantile normalization on those columns. At the end it will merge back the non-numeric (or not to be normalized) columns from your original data frame.

    e.g. if you added some 'meta-data' (char) to the wiki example:

    df = pd.DataFrame({
        'rep1': [5, 2, 3, 4],
        'rep2': [4, 1, 4, 2],
        'rep3': [3, 4, 6, 8],
        'char': ['gene_a', 'gene_b', 'gene_c', 'gene_d']
    }, index = ['a', 'b', 'c', 'd'])
    

    you can then call

    quantile_normalize(t, ['rep1', 'rep2', 'rep3'])
    

    to get

        rep1        rep2        rep3        char
    a   5.666667    4.666667    2.000000    gene_a
    b   2.000000    2.000000    3.000000    gene_b
    c   3.000000    4.666667    4.666667    gene_c
    d   4.666667    3.000000    5.666667    gene_d
    
    0 讨论(0)
提交回复
热议问题