问题

I work with significantly sized (48K rows, up to tens of columns) DataFrames. At a certain point in their manipulation, I need to do pair-wise subtractions of column values and I was wondering if there is a more efficient way to do so rather than the one I'm doing (see below).

My current code:

 # Matrix is the pandas DataFrame containing all the data
 comparison_df = pandas.DataFrame(index=matrix.index)
 combinations = itertools.product(group1, group2)

 for observed, reference in combinations:

     observed_data = matrix[observed]
     reference_data = matrix[reference]

     comparison = observed_data - reference_data
     name = observed + "_" + reference
     comparison_df[name] = comparison

Since the data can be large (I'm using this piece of code also during a permutation test), I'm interested in knowing if it can be optimized a bit.

EDIT: As requested, here's a sample of a typical data set

ID                    A1      A2       A3       B1       B2       B3
Ku8QhfS0n_hIOABXuE    6.343   6.304    6.410    6.287    6.403    6.279
fqPEquJRRlSVSfL.8A    6.752   6.681    6.680    6.677    6.525    6.739
ckiehnugOno9d7vf1Q    6.297   6.248    6.524    6.382    6.316    6.453
x57Vw5B5Fbt5JUnQkI    6.268   6.451    6.379    6.371    6.458    6.333

And a typical result would be, if the "A" group is group1 and "B" group2, for each ID row, to have for each column a pair (e.g., A1_B1, A2_B1, A3_B1...) corresponding to the pairings generated above, containing the subtraction for each row ID.

回答1:

Using itertools.combinations() on `DataFrame` columns

You can create combinations of columns with itertools.combinations() and evaluate subtractions along with new names based on these pairs:

import pandas as pd
from cStringIO import StringIO
import itertools as iter

matrix = pd.read_csv(StringIO('''ID,A1,A2,A3,B1,B2,B3
Ku8QhfS0n_hIOABXuE,6.343,6.304,6.410,6.287,6.403,6.279
fqPEquJRRlSVSfL.8A,6.752,6.681,6.680,6.677,6.525,6.739
ckiehnugOno9d7vf1Q,6.297,6.248,6.524,6.382,6.316,6.453
x57Vw5B5Fbt5JUnQkI,6.268,6.451,6.379,6.371,6.458,6.333''')).set_index('ID')

print 'Original DataFrame:'
print matrix
print

# Create DataFrame to fill with combinations
comparison_df = pd.DataFrame(index=matrix.index)
# Create combinations of columns
for a, b in iter.combinations(matrix.columns, 2):
    # Subtract column combinations
    comparison_df['{}_{}'.format(a, b)] = matrix[a] - matrix[b]

print 'Combination DataFrame:'
print comparison_df

Original DataFrame:
                       A1     A2     A3     B1     B2     B3
ID                                                          
Ku8QhfS0n_hIOABXuE  6.343  6.304  6.410  6.287  6.403  6.279
fqPEquJRRlSVSfL.8A  6.752  6.681  6.680  6.677  6.525  6.739
ckiehnugOno9d7vf1Q  6.297  6.248  6.524  6.382  6.316  6.453
x57Vw5B5Fbt5JUnQkI  6.268  6.451  6.379  6.371  6.458  6.333

Combination DataFrame:
                    A1_A2  A1_A3  A1_B1  A1_B2  A1_B3  A2_A3  A2_B1  A2_B2  A2_B3  A3_B1  A3_B2  A3_B3  B1_B2  B1_B3  B2_B3
ID                                                                                                                         
Ku8QhfS0n_hIOABXuE  0.039 -0.067  0.056 -0.060  0.064 -0.106  0.017 -0.099  0.025  0.123  0.007  0.131 -0.116  0.008  0.124
fqPEquJRRlSVSfL.8A  0.071  0.072  0.075  0.227  0.013  0.001  0.004  0.156 -0.058  0.003  0.155 -0.059  0.152 -0.062 -0.214
ckiehnugOno9d7vf1Q  0.049 -0.227 -0.085 -0.019 -0.156 -0.276 -0.134 -0.068 -0.205  0.142  0.208  0.071  0.066 -0.071 -0.137
x57Vw5B5Fbt5JUnQkI -0.183 -0.111 -0.103 -0.190 -0.065  0.072  0.080 -0.007  0.118  0.008 -0.079  0.046 -0.087  0.038  0.125

来源：https://stackoverflow.com/questions/13143050/calculating-subtractions-of-pairs-of-columns-in-pandas-dataframe

标签

python

pandas

data-analysis

Calculating subtractions of pairs of columns in pandas DataFrame

问题

回答1:

Using itertools.combinations() on DataFrame columns

Using itertools.combinations() on `DataFrame` columns