Vectorizing an iterative function on Pandas DataFrame

一世执手 提交于 2021-02-08 10:23:08

问题


I have a dataframe where the first row is the initial condition.

df = pd.DataFrame({"Year": np.arange(4),
                   "Pop": [0.4] + [np.nan]* 3})

and a function f(x,r) = r*x*(1-x), where r = 2 is a constant and 0 <= x <= 1.

I want to produce the following dataframe by applying the function to column Pop row-by-row iteratively. I.e., df.Pop[i] = f(df.Pop[i-1], r=2)

df = pd.DataFrame({"Year": np.arange(4),
                   "Pop": [0.4, 0.48, 4992, 0.49999872]})

Question: Is it possible to do this in a vectorized way?

I can achieve the desired result by using a loop to build lists for the x and y values, but this is not vectorized.

I have also tried this, but all nan places are filled with 0.48.

df.loc[1:, "Pop"] = R * df.Pop[:-1] * (1 - df.Pop[:-1])

回答1:


It is IMPOSSIBLE to do this in a vectorized way.

By definition, vectorization makes use of parallel processing to reduce execution time. But the desired values in your question must be computed in sequential order, not in parallel. See this answer for detailed explanation. Things like df.expanding(2).apply(f) and df.rolling(2).apply(f) won't work.

However, gaining more efficiency is possible. You can do the iteration using a generator. This is a very common construct for implementing iterative processes.

def gen(x_init, n, R=2):
    x = x_init
    for _ in range(n):
        x = R * x * (1-x)
        yield x

# execute            
df.loc[1:, "Pop"] = list(gen(df.at[0, "Pop"], len(df) - 1))

Result:

print(df)
        Pop
0  0.400000
1  0.480000
2  0.499200
3  0.499999

It is completely OK to stop here for small-sized data. If the function is going to be performed a lot of times, however, you can consider optimizing the generator with numba.

  • pip install numba or conda install numba in the console first
  • import numba
  • Add decorator @numba.njit in front of the generator.

Change the number of np.nans to 10^6 and check out the difference in execution time yourself. An improvement from 468ms to 217ms was achieved on my Core-i5 8250U 64bit laptop.



来源:https://stackoverflow.com/questions/64515499/vectorizing-an-iterative-function-on-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!