64 bit system, 8gb of ram, a bit more than 800MB of CSV and reading with python gives memory error

问题

f = open("data.csv")
f.seek(0) 
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

The above is the code I am using to read a csv file. The csv file is only about 800 MB and I am using a 64 bit system with 8GB of Ram. The file contains 100 million lines. However,not to mention to read the entire file, even reading the first 10 million lines gives me a 'MemoryError:" <- this is really the entire error message.

Could someone tell me why please? Also as a side question, could someone tell me how to read from, say the 20th million row please? I know I need to use f.seek(some number) but since my data is a csv file I dont know which number I should put exactly into f.seek() so that it reads exactly from 20th row.

Thank you very much.

回答1:

could someone tell me how to read from, say the 20th million row please? I know I need to use f.seek(some number)

No, you can't (and mustn't) use f.seek() in this situation. Rather, you must read each of the first 20 million rows somehow.

The Python documentation has this recipie:

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

Using that, you would start after 20,000,000 rows thusly:

#UNTESTED
f = open("data.csv")
f_reader = csv.reader(f)
consume(f_reader, 20000000)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

or perhaps this might go faster:

#UNTESTED
f = open("data.csv")
consume(f, 20000000)
f_reader = csv.reader(f)
raw_data = np.array(list(islice(f_reader,0,10000000)),dtype = int)

来源：https://stackoverflow.com/questions/30404256/64-bit-system-8gb-of-ram-a-bit-more-than-800mb-of-csv-and-reading-with-python

标签

python

csv

numpy

out-of-memory