What is the Big O Complexity of Reversing the Order of Columns in Pandas DataFrame?

后端 未结 3 922
北恋
北恋 2020-12-03 17:30

So lets say I have a DataFrame in pandas with a m rows and n columns. Let\'s also say that I wanted to reverse the order of the columns, which can be done with the following

相关标签:
3条回答
  • 2020-12-03 17:46

    The Big O complexity (as of Pandas 0.24) is m*n where m is the number of columns and n is the number of rows. Note, this is when using the DataFrame.__getitem__ method (aka []) with an Index (see relevant code, with other types that would trigger a copy).

    Here is a helpful stack trace:

     <ipython-input-4-3162cae03863>(2)<module>()
          1 columns = df.columns[::-1]
    ----> 2 df_reversed = df[columns]
    
      pandas/core/frame.py(2682)__getitem__()
       2681             # either boolean or fancy integer index
    -> 2682             return self._getitem_array(key)
       2683         elif isinstance(key, DataFrame):
    
      pandas/core/frame.py(2727)_getitem_array()
       2726             indexer = self.loc._convert_to_indexer(key, axis=1)
    -> 2727             return self._take(indexer, axis=1)
       2728 
    
      pandas/core/generic.py(2789)_take()
       2788                                    axis=self._get_block_manager_axis(axis),
    -> 2789                                    verify=True)
       2790         result = self._constructor(new_data).__finalize__(self)
    
      pandas/core/internals.py(4539)take()
       4538         return self.reindex_indexer(new_axis=new_labels, indexer=indexer,
    -> 4539                                     axis=axis, allow_dups=True)
       4540 
    
      pandas/core/internals.py(4421)reindex_indexer()
       4420             new_blocks = self._slice_take_blocks_ax0(indexer,
    -> 4421                                                      fill_tuple=(fill_value,))
       4422         else:
    
      pandas/core/internals.py(1254)take_nd()
       1253             new_values = algos.take_nd(values, indexer, axis=axis,
    -> 1254                                        allow_fill=False)
       1255         else:
    
    > pandas/core/algorithms.py(1658)take_nd()
       1657     import ipdb; ipdb.set_trace()
    -> 1658     func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis,
       1659                                  mask_info=mask_info)
       1660     func(arr, indexer, out, fill_value)
    

    The func call on L1660 in pandas/core/algorithms ultimately calls a cython function with O(m * n) complexity. This is where data from the the original data is copied into out. out contains a copy of the original data in reversed order.

        inner_take_2d_axis0_template = """\
        cdef:
            Py_ssize_t i, j, k, n, idx
            %(c_type_out)s fv
    
        n = len(indexer)
        k = values.shape[1]
    
        fv = fill_value
    
        IF %(can_copy)s:
            cdef:
                %(c_type_out)s *v
                %(c_type_out)s *o
    
            #GH3130
            if (values.strides[1] == out.strides[1] and
                values.strides[1] == sizeof(%(c_type_out)s) and
                sizeof(%(c_type_out)s) * n >= 256):
    
                for i from 0 <= i < n:
                    idx = indexer[i]
                    if idx == -1:
                        for j from 0 <= j < k:
                            out[i, j] = fv
                    else:
                        v = &values[idx, 0]
                        o = &out[i, 0]
                        memmove(o, v, <size_t>(sizeof(%(c_type_out)s) * k))
                return
    
        for i from 0 <= i < n:
            idx = indexer[i]
            if idx == -1:
                for j from 0 <= j < k:
                    out[i, j] = fv
            else:
                for j from 0 <= j < k:
                    out[i, j] = %(preval)svalues[idx, j]%(postval)s
    """
    

    Note that in the above template function, there is a path that uses memmove (which is the path taken in this case because we are mapping from int64 to int64 and the dimension of the output is identical as we are just switching the indexes). Note that memmove is still O(n), being proportional to the number of bytes it has to copy, although likely faster than writing to the indexes directly.

    0 讨论(0)
  • 2020-12-03 17:53

    I don't know how Pandas implements this, but I did test it empirically. I ran the following code (in a Jupyter notebook) to test the speed of the operation:

    def get_dummy_df(n):
        return pd.DataFrame({'a': [1,2]*n, 'b': [4,5]*n, 'c': [7,8]*n})
    
    df = get_dummy_df(100)
    print df.shape
    %timeit df_r = df[df.columns[::-1]]
    
    df = get_dummy_df(1000)
    print df.shape
    %timeit df_r = df[df.columns[::-1]]
    
    df = get_dummy_df(10000)
    print df.shape
    %timeit df_r = df[df.columns[::-1]]
    
    df = get_dummy_df(100000)
    print df.shape
    %timeit df_r = df[df.columns[::-1]]
    
    df = get_dummy_df(1000000)
    print df.shape
    %timeit df_r = df[df.columns[::-1]]
    
    df = get_dummy_df(10000000)
    print df.shape
    %timeit df_r = df[df.columns[::-1]]
    

    The output was:

    (200, 3)
    1000 loops, best of 3: 419 µs per loop
    (2000, 3)
    1000 loops, best of 3: 425 µs per loop
    (20000, 3)
    1000 loops, best of 3: 498 µs per loop
    (200000, 3)
    100 loops, best of 3: 2.66 ms per loop
    (2000000, 3)
    10 loops, best of 3: 25.2 ms per loop
    (20000000, 3)
    1 loop, best of 3: 207 ms per loop
    

    As you can see, in the first 3 cases, the overhead of the operation is what takes most of the time (400-500µs), but from the 4th case, the time it takes starts to be proportional to the amount of data, increasing in an order of magnitude each time.

    So, assuming there must also be a proportion to n, it seems that we are dealing with O(m*n)

    0 讨论(0)
  • 2020-12-03 18:04

    I ran an empirical test using big_O fitting library here

    Note: All tests have been conducted on independent variable sweeping 6 orders of magnitude (i.e.

    • rows from 10 to 10^6 vs. constant column size of 3,
    • columns from 10 to 10^6 vs. constant row size of 10

    The result shows that the columns reverse operation .columns[::-1] complexity in the DataFrame is

    1. Cubical: O(n^3) where n is the number of rows
    2. Cubical: O(n^3) where n is the number of columns

    Prerequisites: You will need to install big_o() using terminal command pip install big_o

    Code

    import big_o
    import pandas as pd
    import numpy as np
    
    SWEAP_LOG10 = 6
    COLUMNS = 3
    ROWS = 10
    
    def build_df(rows, columns):
        # To isolated the creation of the DataFrame from the inversion operation.
        narray = np.zeros(rows*columns).reshape(rows, columns)
        df = pd.DataFrame(narray)
        return df
    
    def flip_columns(df):
        return df[df.columns[::-1]]
    
    def get_row_df(n, m=COLUMNS):
        return build_df(1*10**n, m)
    
    def get_column_df(n, m=ROWS):
        return build_df(m, 1*10**n)
    
    
    # infer the big_o on columns[::-1] operation vs. rows
    best, others = big_o.big_o(flip_columns, get_row_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)
    
    # print results
    print('Measuring .columns[::-1] complexity against rapid increase in # rows')
    print('-'*80 + '\nBig O() fits: {}\n'.format(best) + '-'*80)
    
    for class_, residual in others.items():
        print('{:<60s}  (res: {:.2G})'.format(str(class_), residual))
    
    print('-'*80)
    
    # infer the big_o on columns[::-1] operation vs. columns
    best, others = big_o.big_o(flip_columns, get_column_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)
    
    # print results
    print()
    print('Measuring .columns[::-1] complexity against rapid increase in # columns')
    print('-'*80 + '\nBig O() fits: {}\n'.format(best) + '-'*80)
    
    for class_, residual in others.items():
        print('{:<60s}  (res: {:.2G})'.format(str(class_), residual))
        
    print('-'*80)
    

    Results

    Measuring .columns[::-1] complexity against rapid increase in # rows
    --------------------------------------------------------------------------------
    Big O() fits: Cubic: time = -0.017 + 0.00067*n^3
    --------------------------------------------------------------------------------
    Constant: time = 0.032                                        (res: 0.021)
    Linear: time = -0.051 + 0.024*n                               (res: 0.011)
    Quadratic: time = -0.026 + 0.0038*n^2                         (res: 0.0077)
    Cubic: time = -0.017 + 0.00067*n^3                            (res: 0.0052)
    Polynomial: time = -6.3 * x^1.5                               (res: 6)
    Logarithmic: time = -0.026 + 0.053*log(n)                     (res: 0.015)
    Linearithmic: time = -0.024 + 0.012*n*log(n)                  (res: 0.0094)
    Exponential: time = -7 * 0.66^n                               (res: 3.6)
    --------------------------------------------------------------------------------
    
    
    Measuring .columns[::-1] complexity against rapid increase in # columns
    --------------------------------------------------------------------------------
    Big O() fits: Cubic: time = -0.28 + 0.009*n^3
    --------------------------------------------------------------------------------
    Constant: time = 0.38                                         (res: 3.9)
    Linear: time = -0.73 + 0.32*n                                 (res: 2.1)
    Quadratic: time = -0.4 + 0.052*n^2                            (res: 1.5)
    Cubic: time = -0.28 + 0.009*n^3                               (res: 1.1)
    Polynomial: time = -6 * x^2.2                                 (res: 16)
    Logarithmic: time = -0.39 + 0.71*log(n)                       (res: 2.8)
    Linearithmic: time = -0.38 + 0.16*n*log(n)                    (res: 1.8)
    Exponential: time = -7 * 1^n                                  (res: 9.7)
    --------------------------------------------------------------------------------
    
    0 讨论(0)
提交回复
热议问题