问题
I'm fairly new to both Python and Pandas, and trying to figure out the fastest way to execute a mammoth left outer join between a left dataset with roughly 11 million rows and a right dataset with ~160K rows and four columns. It should be a many-to-one situation but I'd like the join to not kick out an error if there's a duplicate row on the right side. I'm using Canopy Express on a Windows 7 64-bit system with 8 Gb RAM, and I'm pretty much stuck with that.
Here's a model of the code I've put together so far:
import pandas as pd
leftcols = ['a','b','c','d','e','key']
leftdata = pd.read_csv("LEFT.csv", names=leftcols)
rightcols = ['x','y','z','key']
rightdata = pd.read_csv("RIGHT.csv", names=rightcols)
mergedata = pd.merge(leftdata, rightdata, on='key', how='left')
mergedata.to_csv("FINAL.csv")
This works with small files but produces a MemoryError on my system with file sizes two orders of magnitude smaller than the size of the files I actually need to merge.
I've been browsing through related questions (one, two, three) but none of the answers really get at this basic problem - or if they do, it's not explained well enough for me to recognize the potential solution. And the accepted answers are no help. I'm already on a 64 bit system and using the most current stable version of Canopy (1.5.5 64-bit, using Python 2.7.10).
What is the fastest and/or most pythonic approach to avoiding this MemoryError issue?
回答1:
Why not just read your right file into pandas (or even into a simple dictionary), then loop through your left file using the csv
module to read, extend, and write each row? Is processing time a significant constraint (vs your development time)?
回答2:
This approach ended up working. Here's a model of my code:
import csv
idata = open("KEY_ABC.csv","rU")
odata = open("KEY_XYZ.csv","rU")
leftdata = csv.reader(idata)
rightdata = csv.reader(odata)
def gen_chunks(reader, chunksize=1000000):
chunk = []
for i, line in enumerate(reader):
if (i % chunksize == 0 and i > 0):
yield chunk
del chunk[:]
chunk.append(line)
yield chunk
count = 0
d1 = dict([(rows[3],rows[0]) for rows in rightdata])
odata.seek(0)
d2 = dict([(rows[3],rows[1]) for rows in rightdata])
odata.seek(0)
d3 = dict([(rows[3],rows[2]) for rows in rightdata])
for chunk in gen_chunks(leftdata):
res = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6],
d1.get(k[6], "NaN")] for k in chunk]
res1 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7],
d2.get(k[6], "NaN")] for k in res]
res2 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], k[8],
d3.get(k[6], "NaN")] for k in res1]
namestart = "FINAL_"
nameend = ".csv"
count = count+1
filename = namestart + str(count) + nameend
with open(filename, "wb") as csvfile:
output = csv.writer(csvfile)
output.writerows(res2)
By splitting the left dataset into chunks, turning the right dataset into one dictionary per non-key column, and by adding columns to the left dataset (filling them using the dictionaries and the key match), the script managed to do the whole left join in about four minutes with no memory issues.
Thanks also to user miku who provided the chunk generator code in a comment on this post.
That said: I highly doubt this is the most efficient way of doing this. If anyone has suggestions to improve this approach, fire away.
回答3:
As suggested in another question "Large data" work flows using pandas, dask (http://dask.pydata.org) could be an easy option.
Simple example
import dask.dataframe as dd
df1 = dd.read_csv('df1.csv')
df2 = dd.read_csv('df2.csv')
df_merge = dd.merge(df1, df2, how='left')
来源:https://stackoverflow.com/questions/32635169/memoryerror-with-python-pandas-and-large-left-outer-joins