when to use DataFrame.eval() versus pandas.eval() or python eval()

后端 未结 1 1909
孤独总比滥情好
孤独总比滥情好 2021-01-02 01:19

I have a few dozen conditions (e.g., foo > bar) that I need to evaluate on ~1MM rows of a DataFrame, and the most concise way of writing this is

相关标签:
1条回答
  • 2021-01-02 02:02

    So is the benefit of DataFrame.eval() merely in simplifying the input, or can we identify circumstances where using this method is actually faster?

    The source code for DataFrame.eval() shows that it actually just creates arguments to pass to pd.eval():

    def eval(self, expr, inplace=None, **kwargs):
    
        inplace = validate_bool_kwarg(inplace, 'inplace')
        resolvers = kwargs.pop('resolvers', None)
        kwargs['level'] = kwargs.pop('level', 0) + 1
        if resolvers is None:
            index_resolvers = self._get_index_resolvers()
            resolvers = dict(self.iteritems()), index_resolvers
        if 'target' not in kwargs:
            kwargs['target'] = self
        kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
        return _eval(expr, inplace=inplace, **kwargs)
    

    Where _eval() is just an alias for pd.eval() which is imported at the beginning of the module:

    from pandas.core.computation.eval import eval as _eval
    

    So anything that you can do with df.eval(), you could do with pd.eval() + a few extra lines to set things up. As things currently stand, df.eval() is never strictly faster than pd.eval(). But that doesn't mean there can't be cases where df.eval() is just as good as pd.eval(), yet more convenient to write.

    However, after playing around with the %prun magic it appears that the call by df.eval() to df._get_index_resolvers() adds on a fair bit of time to the df.eval() method. Ultimately, _get_index_resolvers() ends up calling the .copy() method of numpy.ndarray, which is what ends up slowing things down. Meanwhile, pd.eval() does call numpy.ndarray.copy() at some point, but it takes a negligible amount of time (on my machine at least).

    Long story short, it appears that df.eval() tends to be slower than pd.eval() because under the hood it's just pd.eval() with extra steps, and these steps are non-trivial.

    0 讨论(0)
提交回复
热议问题