Pandas multiply dataframes with multiindex and overlapping index levels

前端 未结 3 566
情书的邮戳
情书的邮戳 2021-01-01 16:28

I´m struggling with a task that should be simple, but it is not working as I thought it would. I have two numeric dataframes A and B with multiindex and columns below:

相关标签:
3条回答
  • 2021-01-01 17:09

    I'd simply use DF.reindex on the lesser shaped DF to match the index of that of the bigger DF's shape and forward fill the values present in it. Then do the multiplication.

    B.multiply(A.reindex(B.index, method='ffill'))             # Or method='pad'
    

    Demo:

    Prep up some data:

    np.random.seed(42)
    midx1 = pd.MultiIndex.from_product([['X', 'Y'], [1,2,3]])
    midx2 = pd.MultiIndex.from_product([['X', 'Y'], [1,2,3], ['a','b','c']])
    A = pd.DataFrame(np.random.randint(0,2,(6,4)), midx1, list('ABCD'))
    B = pd.DataFrame(np.random.randint(2,4,(18,4)), midx2, list('ABCD'))
    

    Small DF:

    >>> A
    
         A  B  C  D
    X 1  0  1  0  0
      2  0  1  0  0
      3  0  1  0  0
    Y 1  0  0  1  0
      2  1  1  1  0
      3  1  0  1  1
    

    Big DF:

    >>> B 
    
          A  B  C  D
    X 1 a  3  3  3  3
        b  3  3  2  2
        c  3  3  3  2
      2 a  3  2  2  2
        b  2  2  3  3
        c  3  3  3  2
      3 a  3  3  2  3
        b  2  3  2  3
        c  3  2  2  2
    Y 1 a  2  2  2  2
        b  2  3  3  2
        c  3  3  3  3
      2 a  2  3  2  3
        b  3  3  2  3
        c  2  3  2  3
      3 a  2  2  3  2
        b  3  3  3  3
        c  3  3  3  3
    

    Multiplying them after making sure both share a common index axis across all levels:

    >>> B.multiply(A.reindex(B.index, method='ffill'))
    
           A  B  C  D
    X 1 a  0  3  0  0
        b  0  3  0  0
        c  0  3  0  0
      2 a  0  2  0  0
        b  0  2  0  0
        c  0  3  0  0
      3 a  0  3  0  0
        b  0  3  0  0
        c  0  2  0  0
    Y 1 a  0  0  2  0
        b  0  0  3  0
        c  0  0  3  0
      2 a  2  3  2  0
        b  3  3  2  0
        c  2  3  2  0
      3 a  2  0  3  2
        b  3  0  3  3
        c  3  0  3  3
    

    Now you can even supply the level parameter in DF.multiply for broadcasting to occur at those matching indices.

    0 讨论(0)
  • 2021-01-01 17:27

    Proposed approach

    We are talking about broadcasting, thus I would like to bring in NumPy supported broadcasting here.

    The solution code would look something like this -

    def numpy_broadcasting(df0, df1):
        m,n,r = map(len,df1.index.levels)
        a0 = df0.values.reshape(m,n,-1)
        a1 = df1.values.reshape(m,n,r,-1)
        out = (a1*a0[...,None,:]).reshape(-1,a1.shape[-1])
        df_out = pd.DataFrame(out, index=df1.index, columns=df1.columns)
        return df_out
    

    Basic idea :

    1] Get views into the dataframe as multidimensional arrays. The multidimensionality is maintained according to the level structure of the multindex dataframe. Thus, the first dataframe would have three levels (including the columns) and the second one has four levels. Thus, we have a0 and a1 corresponding to the input dataframes df0 and df1, resulting in a0 and a1 having 3 and 4 dimensions respectively.

    2) Now, comes the broadcasting part. We simply extend a0 to have 4 dimensions by introducing a new axis at the third position. This new axis would match up against the third axis from df1. This allows us to perform element-wise multiplication.

    3) Finally, to get the output multindex dataframe, we simply reshape the product.

    Sample run :

    1) Input dataframes -

    In [369]: df0
    Out[369]: 
         A  B  C  D
    0 0  3  2  2  3
      1  6  8  1  0
      2  3  5  1  5
    1 0  7  0  3  1
      1  7  0  4  6
      2  2  0  5  0
    
    In [370]: df1
    Out[370]: 
           A  B  C  D
    0 0 0  4  6  1  2
        1  3  3  4  5
        2  8  1  7  4
      1 0  7  2  5  4
        1  8  6  7  5
        2  0  4  7  1
      2 0  1  4  2  2
        1  2  3  8  1
        2  0  0  5  7
    1 0 0  8  6  1  7
        1  0  6  1  4
        2  5  4  7  4
      1 0  4  7  0  1
        1  4  2  6  8
        2  3  1  0  6
      2 0  8  4  7  4
        1  0  6  2  0
        2  7  8  6  1
    

    2) Output dataframe -

    In [371]: df_out
    Out[371]: 
            A   B   C   D
    0 0 0  12  12   2   6
        1   9   6   8  15
        2  24   2  14  12
      1 0  42  16   5   0
        1  48  48   7   0
        2   0  32   7   0
      2 0   3  20   2  10
        1   6  15   8   5
        2   0   0   5  35
    1 0 0  56   0   3   7
        1   0   0   3   4
        2  35   0  21   4
      1 0  28   0   0   6
        1  28   0  24  48
        2  21   0   0  36
      2 0  16   0  35   0
        1   0   0  10   0
        2  14   0  30   0
    

    Benchmarking

    In [31]: # Setup input dataframes of the same shape as stated in the question
        ...: individuals = list(range(2))
        ...: time = (0, 1, 2)
        ...: index = pd.MultiIndex.from_tuples(list(product(individuals, time)))
        ...: A = pd.DataFrame(data={'A': np.random.randint(0,9,6), \
        ...:                          'B': np.random.randint(0,9,6), \
        ...:                          'C': np.random.randint(0,9,6), \
        ...:                          'D': np.random.randint(0,9,6)
        ...:                          }, index=index)
        ...: 
        ...: 
        ...: individuals = list(range(2))
        ...: time = (0, 1, 2)
        ...: P = (0,1,2)
        ...: index = pd.MultiIndex.from_tuples(list(product(individuals, time, P)))
        ...: B = pd.DataFrame(data={'A': np.random.randint(0,9,18), \
        ...:                          'B': np.random.randint(0,9,18), \
        ...:                          'C': np.random.randint(0,9,18), \
        ...:                          'D': np.random.randint(0,9,18)}, index=index)
        ...: 
    
    # @DSM's solution
    In [32]: %timeit B * A.loc[B.index.droplevel(2)].set_index(B.index)
    1 loops, best of 3: 8.75 ms per loop
    
    # @Nickil Maveli's solution
    In [33]: %timeit B.multiply(A.reindex(B.index, method='ffill'))
    1000 loops, best of 3: 625 µs per loop
    
    # @root's solution
    In [34]: %timeit B * np.repeat(A.values, 3, axis=0)
    1000 loops, best of 3: 487 µs per loop
    
    In [35]: %timeit numpy_broadcasting(A, B)
    1000 loops, best of 3: 191 µs per loop
    
    0 讨论(0)
  • 2021-01-01 17:32

    Note that I am not claiming this is the right way to do this operation, only that it's one way to do it. I've had issues figuring out the right broadcast pattern in the past myself. :-/

    The short version is that I wind up doing the broadcasting manually, and creating an appropriately-aligned intermediate object:

    In [145]: R = B * A.loc[B.index.droplevel(2)].set_index(B.index)
    
    In [146]: A.loc[("X", 2), "C"]
    Out[146]: 0.5294149302910357
    
    In [147]: A.loc[("X", 2), "C"] * B.loc[("X", 2, "c"), "C"]
    Out[147]: 0.054262618238601339
    
    In [148]: R.loc[("X", 2, "c"), "C"]
    Out[148]: 0.054262618238601339
    

    This works by indexing into A using the matching parts of B, and then setting the index to match. If I were more clever I'd be able to figure out a native way to get this to work but I haven't yet. :-(

    0 讨论(0)
提交回复
热议问题