Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?
PS. I know that there is a package named rpy2 w
The code below gives identical result as preprocessCore::normalize.quantiles.use.target
and I find it simpler clearer than the solutions above. Also performance should be good up to huge array lengths.
import numpy as np
def quantile_normalize_using_target(x, target):
"""
Both `x` and `target` are numpy arrays of equal lengths.
"""
target_sorted = np.sort(target)
return target_sorted[x.argsort().argsort()]
Once you have a pandas.DataFrame
easy to do:
quantile_normalize_using_target(df[0].as_matrix(),
df[1].as_matrix())
(Normalizing the first columnt to the second one as a reference distribution in the example above.)
Using the example dataset from Wikipedia article:
df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})
df
Out:
C1 C2 C3
A 5 4 3
B 2 1 4
C 3 4 6
D 4 2 8
For each rank, the mean value can be calculated with the following:
rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()
rank_mean
Out:
1 2.000000
2 3.000000
3 4.666667
4 5.666667
dtype: float64
Then the resulting Series, rank_mean
, can be used as a mapping for the ranks to get the normalized results:
df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
Out:
C1 C2 C3
A 5.666667 4.666667 2.000000
B 2.000000 2.000000 3.000000
C 3.000000 4.666667 4.666667
D 4.666667 3.000000 5.666667
One thing worth noticing is that both ayhan and shawn's code use the smaller rank mean for ties, but if you use R package processcore's normalize.quantiles()
, it would use the mean of rank means for ties.
Using the above example:
> df
C1 C2 C3
A 5 4 3
B 2 1 4
C 3 4 6
D 4 2 8
> normalize.quantiles(as.matrix(df))
C1 C2 C3
A 5.666667 5.166667 2.000000
B 2.000000 2.000000 3.000000
C 3.000000 5.166667 4.666667
D 4.666667 3.000000 5.666667
Possibly more robust to use the median on each row rather than mean (based on code from Shawn. L):
def quantileNormalize(df_input):
df = df_input.copy()
#compute rank
dic = {}
for col in df:
dic[col] = df[col].sort_values(na_position='first').values
sorted_df = pd.DataFrame(dic)
#rank = sorted_df.mean(axis = 1).tolist()
rank = sorted_df.median(axis = 1).tolist()
#sort
for col in df:
# compute percentile rank [0,1] for each score in column
t = df[col].rank( pct=True, method='max' ).values
# replace percentile values in column with quantile normalized score
# retrieve q_norm score using calling rank with percentile value
df[col] = [ np.nanpercentile( rank, i*100 ) if ~np.isnan(i) else np.nan for i in t ]
return df
As pointed out by @msg, none of the solutions here take ties into account. I made a python package called qnorm which handles ties, and correctly recreates the Wikipedia quantile normalization example:
import pandas as pd
import qnorm
df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})
print(qnorm.quantile_normalize(df))
C1 C2 C3
A 5.666667 5.166667 2.000000
B 2.000000 2.000000 3.000000
C 3.000000 5.166667 4.666667
D 4.666667 3.000000 5.666667
Installation can be done with either pip or conda
pip install qnorm
or
conda config --add channels conda-forge
conda install qnorm
Ok I implemented the method myself of relatively high efficiency.
After finishing, this logic seems kind of easy but, anyway, I decided to post it here for any one feels confused like I was when I couldn't googled the available code.
The code is in github: Quantile Normalize