I currently have a massive set of datasets. I have a set for each year in the 2000\'s. I take a combination of three years and run a code on that to clean. The problem is that d
Use chunked pandas by importing Blaze. Instructions from http://blaze.readthedocs.org/en/latest/ooc.html
Naive use of Blaze triggers out-of-core systems automatically when called on large files.
d = Data('my-small-file.csv')
d.my_column.count() # Uses Pandas
d = Data('my-large-file.csv')
d.my_column.count() # Uses Chunked Pandas
How does it work? Blaze breaks up the data resource into a sequence of chunks. It pulls one chunk into memory, operates on it, pulls in the next, etc.. After all chunks are processed it often has to finalize the computation with another operation on the intermediate results.
One way to achieve this would be like as follows:
import pandas as pd
# generating random DF
num_rows = 100
locs = list('abcdefghijklmno')
df = pd.DataFrame(
{'id': np.random.randint(1, 100, num_rows),
'location': np.random.choice(locs, num_rows),
'year': np.random.randint(2005, 2007, num_rows)})
df.sort_values('id', inplace=True)
print('**** sorted DF (first 10 rows) ****')
print(df.head(10))
# chopping DF into chunks ...
chunk_size = 5
chunks = [i for i in df.id.unique()[::chunk_size]]
chunk_margins = [(chunks[i-1],chunks[i]) for i in range(1, len(chunks))]
df_chunks = [df.ix[(df.id >= x[0]) & (df.id < x[1])] for x in chunk_margins]
print('**** first chunk ****')
print(df_chunks[0])
Output:
**** sorted DF (first 10 rows) ****
id location year
31 2 c 2005
85 2 e 2006
89 2 l 2006
70 2 i 2005
60 4 n 2005
68 7 g 2005
22 7 e 2006
73 10 i 2005
23 10 j 2006
47 16 n 2005
**** first chunk ****
id location year
31 2 c 2005
85 2 e 2006
89 2 l 2006
70 2 i 2005
60 4 n 2005
68 7 g 2005
22 7 e 2006
73 10 i 2005
23 10 j 2006
47 16 n 2005
6 16 k 2006
82 16 g 2005