interpolating values from a dataframe based on a column value

后端 未结 1 1546
青春惊慌失措
青春惊慌失措 2021-01-20 16:37

Assuming I have a the following problem:

import pandas as pd
import numpy as np

xp = [0.0, 0.5, 1.0]

np.random.seed(100)
df = pd.DataFrame(np.random.rand(1         


        
1条回答
  •  滥情空心
    2021-01-20 17:14

    One good solution for making this faster is pandas.DataFrame.eval():

    TL;DR

    Seconds per number of rows
    Rows:     100   1000  10000    1E5    1E6    1E7
    apply:  0.076  0.734  7.812
    eval:   0.056  0.053  0.058  0.087  0.338  2.887
    

    As can be seen from these timings, eval() has a lot of setup overhead, and up to 10,000 rows basically takes the same time. But it is two orders of magnitude faster than the apply, and thus it certainly worth the overhead for large data sets.

    What is it?

    From the (DOCS)

    pandas.eval(expr, parser='pandas', engine=None, truediv=True, 
                local_dict=None, global_dict=None, resolvers=(),
                level=0, target=None, inplace=None)
    

    Evaluate a Python expression as a string using various backends.

    The following arithmetic operations are supported: +, -, *, /, ** , %, // (python engine only) along with the following boolean operations: | (or), & (and), and ~ (not). Additionally, the 'pandas' parser allows the use of and, or, and not with the same semantics as the corresponding bitwise operators. Series and DataFrame objects are supported and behave as they would with plain ol’ Python evaluation.

    Tricks performed for this Question:

    The code below exploits the fact that the interpolation is always only in two segments. It actually calculates the interpolant for both segments, and then discards the unused segment via a multiply by a bool test (ie, 0, 1)

    The actual expression passed to eval is:

    ((y2-y1) / 0.5 * (x0-0.0) + y1) * (x0 < 0.5)+((y3-y2) / 0.5 * (x0-0.5) + y2) * (x0 >= 0.5)
    

    Code:

    import pandas as pd
    import numpy as np
    
    xp = [0.0, 0.5, 1.0]
    
    np.random.seed(100)
    
    def method1():
        df['interp'] = df.apply(
            lambda x: np.interp(x.x0, xp, [x.y1, x.y2, x.y3]), axis=1)
    
    def method2():
        exp = '((y%d-y%d) / %s * (x0-%s) + y%d) * (x0 %s 0.5)'
        exp_1 = exp % (2, 1, xp[1] - xp[0], xp[0], 1, '<')
        exp_2 = exp % (3, 2, xp[2] - xp[1], xp[1], 2, '>=')
    
        df['interp2'] = df.eval(exp_1 + '+' + exp_2)
    
    from timeit import timeit
    
    def runit(stmt):
        print("%s: %.3f" % (
            stmt, timeit(stmt + '()', number=10,
                         setup='from __main__ import ' + stmt)))
    
    def runit_size(size):
        global df
        df = pd.DataFrame(
            np.random.rand(size, 4), columns=['x0', 'y1', 'y2', 'y3'])
    
        print('Rows: %d' % size)
        if size <= 10000:
            runit('method1')
        runit('method2')
    
    for i in (100, 1000, 10000, 100000, 1000000, 10000000):
        runit_size(i)
    
    print(df.head())
    

    Results:

             x0        y1        y2        y3    interp   interp2
    0  0.060670  0.949837  0.608659  0.672003  0.908439  0.908439
    1  0.462774  0.704273  0.181067  0.647582  0.220021  0.220021
    2  0.568109  0.954138  0.796690  0.585310  0.767897  0.767897
    3  0.455355  0.738452  0.812236  0.927291  0.805648  0.805648
    4  0.826376  0.029957  0.772803  0.521777  0.608946  0.608946
    

    0 讨论(0)
提交回复
热议问题