I have a large DataFrame (million +) records I\'m using to store core of my data (like a database) and I then have a smaller DataFrame (1 to 2000) records that I\'m combinin
I've had to copy between large dataframes a fair bit. I'm using dataframes with realtime market data, which may not be what pandas is designed for, but this is my experience..
On my pc, copying a single datapoint with .at
takes 15µs with the df size making negligible difference. .loc
takes a minimum of 550µs and increases as the df gets larger: 3100µs to copy a single point from one 100000x2 df to another. .ix
seems to be just barely faster than .loc
.
For a single datapoint .at
is very fast and is not impacted by the size of the dataframe, but it cannot handle ranges so loops are required, and as such the time scaling is linear. .loc
and .ix
on the other hand are (relatively) very slow for single datapoints, but they can handle ranges and scale up better than linearly. However, unlike .at
they slow down significantly wrt dataframe size.
Therefore when I'm frequently copying small ranges between large dataframes, I tend to use .at with a for loop, and otherwise I use .ix
with a range.
for new_id in new_ids:
for core_col in core_cols:
df_large.at[new_id, core_col] = df_small.at[new_id, core_col]
Of course, to do it properly I'd go with Jeff's solution above, but it's nice to have options.
Caveats of .at
: it doesn't work with ranges, and it doesn't work if the dtype is datetime (and maybe others).
There is nothing inherently slow about using .loc
to set with an alignable frame, though it does go through a bit of code to cover lot of cases, so probably it's not ideal to have in a tight loop. FYI, this example is slightly different that the 2nd example.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from pandas import DataFrame
In [4]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])
In [5]: df
Out[5]:
0 1 2
a 1 1 1
b 1 1 1
c 1 1 1
d 1 1 1
e 1 1 1
f 1 1 1
g 1 1 1
h 1 1 1
i 1 1 1
j 1 1 1
[10 rows x 3 columns]
In [6]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])
In [7]: df2
Out[7]:
1 2
a 0 0
f 0 0
g 0 0
[3 rows x 2 columns]
In [8]: df.loc[df2.index,df2.columns] = df2
In [9]: df
Out[9]:
0 1 2
a 1 0 0
b 1 1 1
c 1 1 1
d 1 1 1
e 1 1 1
f 1 0 0
g 1 0 0
h 1 1 1
i 1 1 1
j 1 1 1
[10 rows x 3 columns]
Here's an alternative. It may or may not fit your data pattern. If the updates (your small frame) are pretty much independent this would work (IOW you are not updating the big frame, then picking out a new sub-frame, then updating, etc. - if this is your pattern, then using .loc
is about right).
Instead of updating the big frame, update the small frame with the columns from the big frame, e.g.:
In [10]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])
In [11]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])
In [12]: needed_columns = df.columns-df2.columns
In [13]: df2[needed_columns] = df.reindex(index=df2.index,columns=needed_columns)
In [14]: df2
Out[14]:
1 2 0
a 0 0 1
f 0 0 1
g 0 0 1
[3 rows x 3 columns]
In [15]: df3 = DataFrame(0,index=list('cji'),columns=[1,2])
In [16]: needed_columns = df.columns-df3.columns
In [17]: df3[needed_columns] = df.reindex(index=df3.index,columns=needed_columns)
In [18]: df3
Out[18]:
1 2 0
c 0 0 1
j 0 0 1
i 0 0 1
[3 rows x 3 columns]
And concat everything together when you want (they are kept in a list in the mean time, or see my comments below, these sub-frames could be moved to external storage when created, then read back before this concatenating step).
In [19]: pd.concat([ df.reindex(index=df.index-df2.index-df3.index), df2, df3]).reindex_like(df)
Out[19]:
0 1 2
a 1 0 0
b 1 1 1
c 1 0 0
d 1 1 1
e 1 1 1
f 1 0 0
g 1 0 0
h 1 1 1
i 1 0 0
j 1 0 0
[10 rows x 3 columns]
The beauty of this pattern is that it is easily extended to using an actual db (or much better an HDFStore
), to actually store the 'database', then creating/updating sub-frames as needed, then writing out to a new store when finished.
I use this pattern all of the time, though with Panels actually.
I am able to use multi-processes to perform my computations AND write each individual Panel to a file separate as they are all completely independent. The only dependent part is the concat.
This is essentially a map-reduce pattern.