quantile normalization on pandas dataframe

前端未结

关注

 7  766

余生分开走

Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?

PS. I know that there is a package named rpy2 w

相关标签:

7条回答

生来不讨喜

2020-12-14 22:22

I am new to pandas and late to the question, but I think answer might also be of use. It builds off of the great answer from @ayhan:

def quantile_normalize(dataframe, cols, pandas=pd):

    # copy dataframe and only use the columns with numerical values
    df = dataframe.copy().filter(items=cols)

    # columns from the original dataframe not specified in cols
    non_numeric = dataframe.filter(items=list(filter(lambda col: col not in cols, list(dataframe))))


    rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()  

    norm = df.rank(method='min').stack().astype(int).map(rank_mean).unstack()


    result = pandas.concat([norm, non_numeric], axis=1)
    return result

the main difference here is closer to some real world applications. Often you just have matrices of numerical data in which case the original answer is sufficient.

Sometimes you have text based data in there as well. This lets you specify the columns cols of your numerical data and will run quantile normalization on those columns. At the end it will merge back the non-numeric (or not to be normalized) columns from your original data frame.

e.g. if you added some 'meta-data' (char) to the wiki example:

df = pd.DataFrame({
    'rep1': [5, 2, 3, 4],
    'rep2': [4, 1, 4, 2],
    'rep3': [3, 4, 6, 8],
    'char': ['gene_a', 'gene_b', 'gene_c', 'gene_d']
}, index = ['a', 'b', 'c', 'd'])

you can then call

quantile_normalize(t, ['rep1', 'rep2', 'rep3'])

to get

    rep1        rep2        rep3        char
a   5.666667    4.666667    2.000000    gene_a
b   2.000000    2.000000    3.000000    gene_b
c   3.000000    4.666667    4.666667    gene_c
d   4.666667    3.000000    5.666667    gene_d

0 讨论(0)

上一页 1 2