I tried to look on other answers but I am still not sure the right way to do this. I have a number of really large .csv files (could be a gigabyte each), and I want to first
As pointed out several other times, the first two methods do no actual string parsing, they just read a line at a time without extracting fields. I imagine the majority of the speed difference seen in CSV is due to that.
The CSV module is invaluable if you include any textual data that may include more of the 'standard' CSV syntax than just commas, especially if you're reading from an Excel format.
If you've just got lines like "1,2,3,4" you're probably fine with a simple split, but if you have lines like "1,2,'Hello, my name\'s fred'"
you're going to go crazy trying to parse that without errors.
CSV will also transparently handle things like newlines in the middle of a quoted string.
A simple for..in
without CSV is going to have trouble with that.
The CSV module has always worked fine for me reading unicode strings if I use it like so:
f = csv.reader(codecs.open(filename, 'rU'))
It is plenty of robust for importing multi-thousand line files with unicode, quoted strings, newlines in the middle of quoted strings, lines with fields missing at the end, etc. all with reasonable read times.
I'd try using it first and only looking for optimizations on top of it if you really need the extra speed.
How much do you care about sanitization?
The csv module is really good at understanding different csv file dialects and ensuring that escaping is happing properly, but it's definitely overkill and can often be way more trouble than it's worth (especially if you have unicode!)
A really naive implementation that properly escapes \,
would be:
import re
def read_csv_naive():
with open(<file_str>, 'r') as file_obj:
return [re.split('[^\\],', x) for x in file_obj.splitlines()]
If your data is simple this will work great. If you have data that might need more escaping, the csv module is probably your most stable bet.
Your first 2 methods are NOT parsing each line into fields. The csv
way is parsing out rows (NOT the same as lines!) of fields.
Do your really need to build a list in memory of all the lines?
To read large csv file we have to create child process to read the chunks of file. Open the file to get the file resource object. Create a child process, with resource as argument. Read the set of lines as chunk. Repeat the above 3 steps until you reach the end of file.
from multiprocessing import Process
def child_process(name):
# Do the Read and Process stuff here.if __name__ == '__main__':
# Get file object resource.
.....
p = Process(target=child_process, args=(resource,))
p.start()
p.join()
For code go to this link. This will helps you. http://besttechlab.wordpress.com/2013/12/14/read-csv-file-in-python/