MemoryError with python/pandas and large left outer joins

问题

I'm fairly new to both Python and Pandas, and trying to figure out the fastest way to execute a mammoth left outer join between a left dataset with roughly 11 million rows and a right dataset with ~160K rows and four columns. It should be a many-to-one situation but I'd like the join to not kick out an error if there's a duplicate row on the right side. I'm using Canopy Express on a Windows 7 64-bit system with 8 Gb RAM, and I'm pretty much stuck with that.

Here's a model of the code I've put together so far:

import pandas as pd

leftcols = ['a','b','c','d','e','key']
leftdata = pd.read_csv("LEFT.csv", names=leftcols)

rightcols = ['x','y','z','key']
rightdata = pd.read_csv("RIGHT.csv", names=rightcols)

mergedata = pd.merge(leftdata, rightdata, on='key', how='left')
mergedata.to_csv("FINAL.csv")

This works with small files but produces a MemoryError on my system with file sizes two orders of magnitude smaller than the size of the files I actually need to merge.

I've been browsing through related questions (one, two, three) but none of the answers really get at this basic problem - or if they do, it's not explained well enough for me to recognize the potential solution. And the accepted answers are no help. I'm already on a 64 bit system and using the most current stable version of Canopy (1.5.5 64-bit, using Python 2.7.10).

What is the fastest and/or most pythonic approach to avoiding this MemoryError issue?

回答1:

Why not just read your right file into pandas (or even into a simple dictionary), then loop through your left file using the csv module to read, extend, and write each row? Is processing time a significant constraint (vs your development time)?

回答2:

This approach ended up working. Here's a model of my code:

import csv

idata = open("KEY_ABC.csv","rU")
odata = open("KEY_XYZ.csv","rU")

leftdata = csv.reader(idata)
rightdata = csv.reader(odata)

def gen_chunks(reader, chunksize=1000000):
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

count = 0

d1 = dict([(rows[3],rows[0]) for rows in rightdata])
odata.seek(0)    
d2 = dict([(rows[3],rows[1]) for rows in rightdata])
odata.seek(0)
d3 = dict([(rows[3],rows[2]) for rows in rightdata])

for chunk in gen_chunks(leftdata):
    res = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], 
                d1.get(k[6], "NaN")] for k in chunk]
    res1 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], 
                d2.get(k[6], "NaN")] for k in res]
    res2 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], k[8],
                d3.get(k[6], "NaN")] for k in res1]
    namestart = "FINAL_"
    nameend = ".csv"
    count = count+1
    filename = namestart + str(count) + nameend
    with open(filename, "wb") as csvfile:
        output = csv.writer(csvfile)
        output.writerows(res2)

By splitting the left dataset into chunks, turning the right dataset into one dictionary per non-key column, and by adding columns to the left dataset (filling them using the dictionaries and the key match), the script managed to do the whole left join in about four minutes with no memory issues.

Thanks also to user miku who provided the chunk generator code in a comment on this post.

That said: I highly doubt this is the most efficient way of doing this. If anyone has suggestions to improve this approach, fire away.

回答3:

As suggested in another question "Large data" work flows using pandas, dask (http://dask.pydata.org) could be an easy option.

Simple example

import dask.dataframe as dd
df1 = dd.read_csv('df1.csv')
df2 = dd.read_csv('df2.csv')
df_merge = dd.merge(df1, df2, how='left')

来源：https://stackoverflow.com/questions/32635169/memoryerror-with-python-pandas-and-large-left-outer-joins

标签

python

python-2.7

join

pandas

canopy