Improve Pandas Merge performance

后端未结

关注

 2  867

I specifically dont have performace issue with Pands Merge, as other posts suggest, but I\'ve a class in which there are lot of methods, which does a lot of merge on dataset

相关标签:

2条回答

终归单人心

2020-12-03 03:49

I suggest that you set your merge columns as index, and use df1.join(df2) instead of merge, it's much faster.

Here's some example including profiling:

In [1]:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(1000000), columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.arange(1000000), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))

Here's a regular left merge on A and A2:

In [2]: %%timeit
        x = df1.merge(df2, how='left', left_on='A', right_on='A2')

1 loop, best of 3: 441 ms per loop

Here's the same, using join:

In [3]: %%timeit
        x = df1.set_index('A').join(df2.set_index('A2'), how='left')

1 loop, best of 3: 184 ms per loop

Now obviously if you can set the index before looping, the gain in terms of time will be much greater:

# Do this before looping
In [4]: %%time
df1.set_index('A', inplace=True)
df2.set_index('A2', inplace=True)

CPU times: user 9.78 ms, sys: 9.31 ms, total: 19.1 ms
Wall time: 16.8 ms

Then in the loop, you'll get something that in this case is 30 times faster:

In [5]: %%timeit
        x = df1.join(df2, how='left')
100 loops, best of 3: 14.3 ms per loop

0 讨论(0)

遥遥无期

2020-12-03 03:51

set_index on merging column does indeed speed this up. Below is a slightly more realistic version of julien-marrec's Answer.

import pandas as pd
import numpy as np
myids=np.random.choice(np.arange(10000000), size=1000000, replace=False)
df1 = pd.DataFrame(myids, columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))

%%timeit
    x = df1.merge(df2, how='left', left_on='A', right_on='A2')   
#1 loop, best of 3: 664 ms per loop

%%timeit  
    x = df1.set_index('A').join(df2.set_index('A2'), how='left') 
#1 loop, best of 3: 354 ms per loop

%%time 
    df1.set_index('A', inplace=True)
    df2.set_index('A2', inplace=True)
#Wall time: 16 ms

%%timeit
    x = df1.join(df2, how='left')  
#10 loops, best of 3: 80.4 ms per loop

When the column to be joined has integers not in the same order on both tables you can still expect a great speed up of 8 times.

0 讨论(0)