How to create lazy_evaluated dataframe columns in Pandas

后端 未结 2 1940
逝去的感伤
逝去的感伤 2021-02-05 11:16

A lot of times, I have a big dataframe df to hold the basic data, and need to create many more columns to hold the derivative data calculated by basic data columns.

相关标签:
2条回答
  • 2021-02-05 11:49

    You could subclass DataFrame, and add the column as a property. For example,

    import pandas as pd
    
    class LazyFrame(pd.DataFrame):
        @property
        def derivative_col1(self):
            self['derivative_col1'] = result = self['basic_col1'] + self['basic_col2']
            return result
    
    x = LazyFrame({'basic_col1':[1,2,3],
                   'basic_col2':[4,5,6]})
    print(x)
    #    basic_col1  basic_col2
    # 0           1           4
    # 1           2           5
    # 2           3           6
    

    Accessing the property (via x.derivative_col1, below) calls the derivative_col1 function defined in LazyFrame. This function computes the result and adds the derived column to the LazyFrame instance:

    print(x.derivative_col1)
    # 0    5
    # 1    7
    # 2    9
    
    print(x)
    #    basic_col1  basic_col2  derivative_col1
    # 0           1           4                5
    # 1           2           5                7
    # 2           3           6                9
    

    Note that if you modify a basic column:

    x['basic_col1'] *= 10
    

    the derived column is not automatically updated:

    print(x['derivative_col1'])
    # 0    5
    # 1    7
    # 2    9
    

    But if you access the property, the values are recomputed:

    print(x.derivative_col1)
    # 0    14
    # 1    25
    # 2    36
    
    print(x)
    #    basic_col1  basic_col2  derivative_col1
    # 0          10           4               14
    # 1          20           5               25
    # 2          30           6               36
    
    0 讨论(0)
  • 2021-02-05 11:59

    Starting in 0.13 (releasing very soon), you can do something like this. This is using generators to evaluate a dynamic formula. In-line assignment via eval will be an additional feature in 0.13, see here

    In [19]: df = DataFrame(randn(5, 2), columns=['a', 'b'])
    
    In [20]: df
    Out[20]: 
              a         b
    0 -1.949107 -0.763762
    1 -0.382173 -0.970349
    2  0.202116  0.094344
    3 -1.225579 -0.447545
    4  1.739508 -0.400829
    
    In [21]: formulas = [ ('c','a+b'), ('d', 'a*c')]
    

    Create a generator that evaluates a formula using eval; assigns the result, then yields the result.

    In [22]: def lazy(x, formulas):
       ....:     for col, f in formulas:
       ....:         x[col] = x.eval(f)
       ....:         yield x
       ....:         
    

    In action

    In [23]: gen = lazy(df,formulas)
    
    In [24]: gen.next()
    Out[24]: 
              a         b         c
    0 -1.949107 -0.763762 -2.712869
    1 -0.382173 -0.970349 -1.352522
    2  0.202116  0.094344  0.296459
    3 -1.225579 -0.447545 -1.673123
    4  1.739508 -0.400829  1.338679
    
    In [25]: gen.next()
    Out[25]: 
              a         b         c         d
    0 -1.949107 -0.763762 -2.712869  5.287670
    1 -0.382173 -0.970349 -1.352522  0.516897
    2  0.202116  0.094344  0.296459  0.059919
    3 -1.225579 -0.447545 -1.673123  2.050545
    4  1.739508 -0.400829  1.338679  2.328644
    

    So its user determined ordering for the evaluation (and not on-demand). In theory numba is going to support this, so pandas possibly support this as a backend for eval (which currently uses numexpr for immediate evaluation).

    my 2c.

    lazy evaluation is nice, but can easily be achived by using python's own continuation/generate features, so building it into pandas, while possible, is quite tricky, and would need a really nice usecase to be generally useful.

    0 讨论(0)
提交回复
热议问题