问题
I've a very big csv file (10 gb) and I'd like to read it and create a list of dictionaries where each dictionary represent a line in the csv. Something like
[{'value1': '20150302', 'value2': '20150225','value3': '5', 'IS_SHOP': '1', 'value4': '0', 'value5': 'GA321D01H-K12'},
{'value1': '20150302', 'value2': '20150225', 'value3': '1', 'value4': '0', 'value5': '1', 'value6': 'GA321D01H-K12'}]
I'm trying to achieve it using a generator in order to avoid any memories issues, my current code is the following:
def csv_reader():
with open('export.csv') as f:
reader = csv.DictReader(f)
for row in reader:
yield {key: value for key, value in row.items()}
generator = csv_reader()
list = []
for i in generator:
list.append(i)
The problem is that basically it runs out of memory because of the list becoming too big and the process is killed. Is there a way to achieve the same result (list of dictonaries) in an efficient way? I'm very new to generators/yield so I don't even know if I'm using it correctly.
I also tried to use a virtual environment with pypy but the memory breaks anyway (a little later though).
Basically the reason why I want a list of dictionaries it that I want to try to convert the csv into an avro format using fastavro so any hints on how using fastavro (https://pypi.python.org/pypi/fastavro) without creating a list of dictionaries would be appreciated
回答1:
If the goal is to convert from csv
to avro
, there is no reason to store a complete list of the input values. That's defeating the whole purpose of using the generator. It looks like, after setting up a schema, fastavro's writer is designed to take an iterable and write it out one record at a time, so you can just pass it the generator directly. For example, your code would simply omit the step of creating the list
(side-note: Naming variables list
is a bad idea, since it shadows/stomps the builtin name list
), and just write the generator directly:
from fastavro import writer
def csv_reader():
with open('export.csv') as f:
reader = csv.DictReader(f)
for row in reader:
yield row
# If this is Python 3.3+, you could simplify further to just:
with open('export.csv') as f:
yield from csv.DictReader(f)
# schema could be from the keys of the first row which gets manually written
# or you can provide an explicit schema with documentation for each field
schema = {...}
with open('export.avro', 'wb') as out:
writer(out, schema, csv_reader())
The generator then produces one row at a time, and writer
writes one row at a time. The input rows are discarded after writing, so memory usage remains minimal.
If you need to modify the rows, you'd modify the row
in the csv_reader
generator before yield
-ing it.
来源:https://stackoverflow.com/questions/33919669/creating-list-of-dictionaries-from-big-csv