Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?
PS. I know that there is a package named rpy2 w
I am new to pandas and late to the question, but I think answer might also be of use. It builds off of the great answer from @ayhan:
def quantile_normalize(dataframe, cols, pandas=pd):
# copy dataframe and only use the columns with numerical values
df = dataframe.copy().filter(items=cols)
# columns from the original dataframe not specified in cols
non_numeric = dataframe.filter(items=list(filter(lambda col: col not in cols, list(dataframe))))
rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()
norm = df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
result = pandas.concat([norm, non_numeric], axis=1)
return result
the main difference here is closer to some real world applications. Often you just have matrices of numerical data in which case the original answer is sufficient.
Sometimes you have text based data in there as well. This lets you specify the columns cols
of your numerical data and will run quantile normalization on those columns. At the end it will merge back the non-numeric (or not to be normalized) columns from your original data frame.
e.g. if you added some 'meta-data' (char
) to the wiki example:
df = pd.DataFrame({
'rep1': [5, 2, 3, 4],
'rep2': [4, 1, 4, 2],
'rep3': [3, 4, 6, 8],
'char': ['gene_a', 'gene_b', 'gene_c', 'gene_d']
}, index = ['a', 'b', 'c', 'd'])
you can then call
quantile_normalize(t, ['rep1', 'rep2', 'rep3'])
to get
rep1 rep2 rep3 char
a 5.666667 4.666667 2.000000 gene_a
b 2.000000 2.000000 3.000000 gene_b
c 3.000000 4.666667 4.666667 gene_c
d 4.666667 3.000000 5.666667 gene_d