I have a list of 4 pandas dataframes containing a day of tick data that I want to merge into a single data frame. I cannot understand the behavior of concat on my timestamps
So what are you doing is with append and concat is almost equivalent. The difference is the empty DataFrame. For some reason this causes a big slowdown, not sure exactly why, will have to look at some point. Below is a recreation of basically what you did.
I almost always use concat (though in this case they are equivalent, except for the empty frame); if you don't use the empty frame they will be the same speed.
In [17]: df1 = pd.DataFrame(dict(A = range(10000)),index=pd.date_range('20130101',periods=10000,freq='s'))
In [18]: df1
Out[18]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10000 entries, 2013-01-01 00:00:00 to 2013-01-01 02:46:39
Freq: S
Data columns (total 1 columns):
A 10000 non-null values
dtypes: int64(1)
In [19]: df4 = pd.DataFrame()
The concat
In [20]: %timeit pd.concat([df1,df2,df3])
1000 loops, best of 3: 270 us per loop
This is equavalent of your append
In [21]: %timeit pd.concat([df4,df1,df2,df3])
10 loops, best of
3: 56.8 ms per loop
I have implemented a tiny benchmark (please find the code on Gist) to evaluate the pandas' concat and append. I updated the code snippet and the results after the comment by ssk08 - thanks alot!
The benchmark ran on a Mac OS X 10.13 system with Python 3.6.2 and pandas 0.20.3.
+--------+---------------------------------+---------------------------------+ | | ignore_index=False | ignore_index=True | +--------+---------------------------------+---------------------------------+ | size | append | concat | append/concat | append | concat | append/concat | +--------+--------+--------+---------------+--------+--------+---------------+ | small | 0.4635 | 0.4891 | 94.77 % | 0.4056 | 0.3314 | 122.39 % | +--------+--------+--------+---------------+--------+--------+---------------+ | medium | 0.5532 | 0.6617 | 83.60 % | 0.3605 | 0.3521 | 102.37 % | +--------+--------+--------+---------------+--------+--------+---------------+ | large | 0.9558 | 0.9442 | 101.22 % | 0.6670 | 0.6749 | 98.84 % | +--------+--------+--------+---------------+--------+--------+---------------+
Using ignore_index=False
append
is slightly faster, with ignore_index=True
concat
is slightly faster.
tl;dr
No significant difference between concat
and append
.
Concat gives the flexibility to join based on the axis( all rows or all columns)
Append is the specific case(axis=0, join='outer') of concat
Join is based on the indexes (set by set_index) on how variable =['left','right','inner','couter']
Merge is based on any particular column each of the two dataframes, this columns are variables on like 'left_on', 'right_on', 'on'
One more thing you have to keep in mind that the APPEND() method in Pandas doesn't modify the original object. Instead it creates a new one with combined data. Because of involving creation and data buffer, its performance is not well. You'd better use CONCAT() function when doing multi-APPEND operations.