What is the Big O Complexity of Reversing the Order of Columns in Pandas DataFrame?

后端未结

关注

 3  922

So lets say I have a DataFrame in pandas with a m rows and n columns. Let\'s also say that I wanted to reverse the order of the columns, which can be done with the following

相关标签:

3条回答

天命终不由人

2020-12-03 17:46

The Big O complexity (as of Pandas 0.24) is m*n where m is the number of columns and n is the number of rows. Note, this is when using the DataFrame.__getitem__ method (aka []) with an Index (see relevant code, with other types that would trigger a copy).

Here is a helpful stack trace:

 <ipython-input-4-3162cae03863>(2)<module>()
      1 columns = df.columns[::-1]
----> 2 df_reversed = df[columns]

  pandas/core/frame.py(2682)__getitem__()
   2681             # either boolean or fancy integer index
-> 2682             return self._getitem_array(key)
   2683         elif isinstance(key, DataFrame):

  pandas/core/frame.py(2727)_getitem_array()
   2726             indexer = self.loc._convert_to_indexer(key, axis=1)
-> 2727             return self._take(indexer, axis=1)
   2728 

  pandas/core/generic.py(2789)_take()
   2788                                    axis=self._get_block_manager_axis(axis),
-> 2789                                    verify=True)
   2790         result = self._constructor(new_data).__finalize__(self)

  pandas/core/internals.py(4539)take()
   4538         return self.reindex_indexer(new_axis=new_labels, indexer=indexer,
-> 4539                                     axis=axis, allow_dups=True)
   4540 

  pandas/core/internals.py(4421)reindex_indexer()
   4420             new_blocks = self._slice_take_blocks_ax0(indexer,
-> 4421                                                      fill_tuple=(fill_value,))
   4422         else:

  pandas/core/internals.py(1254)take_nd()
   1253             new_values = algos.take_nd(values, indexer, axis=axis,
-> 1254                                        allow_fill=False)
   1255         else:

> pandas/core/algorithms.py(1658)take_nd()
   1657     import ipdb; ipdb.set_trace()
-> 1658     func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis,
   1659                                  mask_info=mask_info)
   1660     func(arr, indexer, out, fill_value)

The func call on L1660 in pandas/core/algorithms ultimately calls a cython function with O(m * n) complexity. This is where data from the the original data is copied into out. out contains a copy of the original data in reversed order.

    inner_take_2d_axis0_template = """\
    cdef:
        Py_ssize_t i, j, k, n, idx
        %(c_type_out)s fv

    n = len(indexer)
    k = values.shape[1]

    fv = fill_value

    IF %(can_copy)s:
        cdef:
            %(c_type_out)s *v
            %(c_type_out)s *o

        #GH3130
        if (values.strides[1] == out.strides[1] and
            values.strides[1] == sizeof(%(c_type_out)s) and
            sizeof(%(c_type_out)s) * n >= 256):

            for i from 0 <= i < n:
                idx = indexer[i]
                if idx == -1:
                    for j from 0 <= j < k:
                        out[i, j] = fv
                else:
                    v = &values[idx, 0]
                    o = &out[i, 0]
                    memmove(o, v, <size_t>(sizeof(%(c_type_out)s) * k))
            return

    for i from 0 <= i < n:
        idx = indexer[i]
        if idx == -1:
            for j from 0 <= j < k:
                out[i, j] = fv
        else:
            for j from 0 <= j < k:
                out[i, j] = %(preval)svalues[idx, j]%(postval)s
"""

Note that in the above template function, there is a path that uses memmove (which is the path taken in this case because we are mapping from int64 to int64 and the dimension of the output is identical as we are just switching the indexes). Note that memmove is still O(n), being proportional to the number of bytes it has to copy, although likely faster than writing to the indexes directly.

0 讨论(0)

陌清茗

2020-12-03 17:53

I don't know how Pandas implements this, but I did test it empirically. I ran the following code (in a Jupyter notebook) to test the speed of the operation:

def get_dummy_df(n):
    return pd.DataFrame({'a': [1,2]*n, 'b': [4,5]*n, 'c': [7,8]*n})

df = get_dummy_df(100)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(1000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(10000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(100000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(1000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(10000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

The output was:

(200, 3)
1000 loops, best of 3: 419 µs per loop
(2000, 3)
1000 loops, best of 3: 425 µs per loop
(20000, 3)
1000 loops, best of 3: 498 µs per loop
(200000, 3)
100 loops, best of 3: 2.66 ms per loop
(2000000, 3)
10 loops, best of 3: 25.2 ms per loop
(20000000, 3)
1 loop, best of 3: 207 ms per loop

As you can see, in the first 3 cases, the overhead of the operation is what takes most of the time (400-500µs), but from the 4th case, the time it takes starts to be proportional to the amount of data, increasing in an order of magnitude each time.

So, assuming there must also be a proportion to n, it seems that we are dealing with O(m*n)

0 讨论(0)

温柔的废话

2020-12-03 18:04

I ran an empirical test using big_O fitting library here

Note: All tests have been conducted on independent variable sweeping 6 orders of magnitude (i.e.

rows from 10 to 10^6 vs. constant column size of 3,

columns from 10 to 10^6 vs. constant row size of 10

The result shows that the columns reverse operation .columns[::-1] complexity in the DataFrame is

Cubical: O(n^3) where n is the number of rows
Cubical: O(n^3) where n is the number of columns

Prerequisites: You will need to install big_o() using terminal command pip install big_o

Code

import big_o
import pandas as pd
import numpy as np

SWEAP_LOG10 = 6
COLUMNS = 3
ROWS = 10

def build_df(rows, columns):
    # To isolated the creation of the DataFrame from the inversion operation.
    narray = np.zeros(rows*columns).reshape(rows, columns)
    df = pd.DataFrame(narray)
    return df

def flip_columns(df):
    return df[df.columns[::-1]]

def get_row_df(n, m=COLUMNS):
    return build_df(1*10**n, m)

def get_column_df(n, m=ROWS):
    return build_df(m, 1*10**n)


# infer the big_o on columns[::-1] operation vs. rows
best, others = big_o.big_o(flip_columns, get_row_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)

# print results
print('Measuring .columns[::-1] complexity against rapid increase in # rows')
print('-'*80 + '\nBig O() fits: {}\n'.format(best) + '-'*80)

for class_, residual in others.items():
    print('{:<60s}  (res: {:.2G})'.format(str(class_), residual))

print('-'*80)

# infer the big_o on columns[::-1] operation vs. columns
best, others = big_o.big_o(flip_columns, get_column_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)

# print results
print()
print('Measuring .columns[::-1] complexity against rapid increase in # columns')
print('-'*80 + '\nBig O() fits: {}\n'.format(best) + '-'*80)

for class_, residual in others.items():
    print('{:<60s}  (res: {:.2G})'.format(str(class_), residual))
    
print('-'*80)

Results

Measuring .columns[::-1] complexity against rapid increase in # rows
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.017 + 0.00067*n^3
--------------------------------------------------------------------------------
Constant: time = 0.032                                        (res: 0.021)
Linear: time = -0.051 + 0.024*n                               (res: 0.011)
Quadratic: time = -0.026 + 0.0038*n^2                         (res: 0.0077)
Cubic: time = -0.017 + 0.00067*n^3                            (res: 0.0052)
Polynomial: time = -6.3 * x^1.5                               (res: 6)
Logarithmic: time = -0.026 + 0.053*log(n)                     (res: 0.015)
Linearithmic: time = -0.024 + 0.012*n*log(n)                  (res: 0.0094)
Exponential: time = -7 * 0.66^n                               (res: 3.6)
--------------------------------------------------------------------------------


Measuring .columns[::-1] complexity against rapid increase in # columns
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.28 + 0.009*n^3
--------------------------------------------------------------------------------
Constant: time = 0.38                                         (res: 3.9)
Linear: time = -0.73 + 0.32*n                                 (res: 2.1)
Quadratic: time = -0.4 + 0.052*n^2                            (res: 1.5)
Cubic: time = -0.28 + 0.009*n^3                               (res: 1.1)
Polynomial: time = -6 * x^2.2                                 (res: 16)
Logarithmic: time = -0.39 + 0.71*log(n)                       (res: 2.8)
Linearithmic: time = -0.38 + 0.16*n*log(n)                    (res: 1.8)
Exponential: time = -7 * 1^n                                  (res: 9.7)
--------------------------------------------------------------------------------

0 讨论(0)