Performant cartesian product (CROSS JOIN) with pandas

后端 未结 3 1217
一向
一向 2020-11-22 04:23

The contents of this post were originally meant to be a part of Pandas Merging 101, but due to the nature and size of the content required to fully do j

3条回答
  •  失恋的感觉
    2020-11-22 04:56

    Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

    def cartesian_product_basic(left, right):
        return (
           left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))
    
    cartesian_product_basic(left, right)
    
      col1_x  col2_x col1_y  col2_y
    0      A       1      X      20
    1      A       1      Y      30
    2      A       1      Z      50
    3      B       2      X      20
    4      B       2      Y      30
    5      B       2      Z      50
    6      C       3      X      20
    7      C       3      Y      30
    8      C       3      Z      50
    

    How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

    While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

    A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.

    def cartesian_product(*arrays):
        la = len(arrays)
        dtype = np.result_type(*arrays)
        arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
        for i, a in enumerate(np.ix_(*arrays)):
            arr[...,i] = a
        return arr.reshape(-1, la)  
    

    Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

    Disclaimer
    These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your own risk!

    This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and

    def cartesian_product_generalized(left, right):
        la, lb = len(left), len(right)
        idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
        return pd.DataFrame(
            np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))
    
    cartesian_product_generalized(left, right)
    
       0  1  2   3
    0  A  1  X  20
    1  A  1  Y  30
    2  A  1  Z  50
    3  B  2  X  20
    4  B  2  Y  30
    5  B  2  Z  50
    6  C  3  X  20
    7  C  3  Y  30
    8  C  3  Z  50
    
    np.array_equal(cartesian_product_generalized(left, right),
                   cartesian_product_basic(left, right))
    True
    

    And, along similar lines,

    left2 = left.copy()
    left2.index = ['s1', 's2', 's1']
    
    right2 = right.copy()
    right2.index = ['x', 'y', 'y']
    
    
    left2
       col1  col2
    s1    A     1
    s2    B     2
    s1    C     3
    
    right2
      col1  col2
    x    X    20
    y    Y    30
    y    Z    50
    
    np.array_equal(cartesian_product_generalized(left, right),
                   cartesian_product_basic(left2, right2))
    True
    

    This solution can generalise to multiple DataFrames. For example,

    def cartesian_product_multi(*dfs):
        idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
        return pd.DataFrame(
            np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))
    
    cartesian_product_multi(*[left, right, left]).head()
    
       0  1  2   3  4  5
    0  A  1  X  20  A  1
    1  A  1  X  20  B  2
    2  A  1  X  20  C  3
    3  A  1  X  20  D  4
    4  A  1  Y  30  A  1
    

    Further Simplification

    A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

    def cartesian_product_simplified(left, right):
        la, lb = len(left), len(right)
        ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
    
        return pd.DataFrame(
            np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
    
    np.array_equal(cartesian_product_simplified(left, right),
                   cartesian_product_basic(left2, right2))
    True
    

    Performance Comparison

    Benchmarking these solutions on some contrived DataFrames with unique indices, we have

    Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

    Performance Benchmarking Code
    This is the timing script. All functions called here are defined above.

    from timeit import timeit
    import pandas as pd
    import matplotlib.pyplot as plt
    
    res = pd.DataFrame(
           index=['cartesian_product_basic', 'cartesian_product_generalized', 
                  'cartesian_product_multi', 'cartesian_product_simplified'],
           columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
           dtype=float
    )
    
    for f in res.index: 
        for c in res.columns:
            # print(f,c)
            left2 = pd.concat([left] * c, ignore_index=True)
            right2 = pd.concat([right] * c, ignore_index=True)
            stmt = '{}(left2, right2)'.format(f)
            setp = 'from __main__ import left2, right2, {}'.format(f)
            res.at[f, c] = timeit(stmt, setp, number=5)
    
    ax = res.div(res.min()).T.plot(loglog=True) 
    ax.set_xlabel("N"); 
    ax.set_ylabel("time (relative)");
    
    plt.show()
    

提交回复
热议问题