group_id, application_id, reading
and data could look like Sed one-liner:
sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile
The only down-side is that you need to put in n -e
statements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.
The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!
If the rows are sorted by group_id
, then itertools.groupby would be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Use csv to load the file (in case you didn't already know about it).
Here some food for though for you:
import csv
from collections import namedtuple
csvfile = namedtuple('scvfile',('file','writer'))
class CSVFileCollections(object):
def __init__(self,prefix,postfix):
self.prefix = prefix
self.files = {}
def __getitem__(self,item):
if item not in self.files:
file = open(self.prefix+str(item)+self.postfix,'wb')
writer = csv.writer(file,delimiter = ',', quotechar = "'",quoting=csv.QUOTE_MINIMAL)
self.files[item] = csvfile(file,writer)
return self.files[item].writer
def __enter__(self): pass
def __exit__(self, exc_type, exc_value, traceback):
for csvfile in self.files.values() : csvfile.file.close()
with open('huge.csv') as readFile, CSVFileCollections('output','.csv') as output:
reader = csv.reader(readFile, delimiter=",", quotechar="'")
for row in reader:
writer = output[row[0]]
writer.writerow(row)
If the file is already sorted by group_id
, you can do something like:
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("foo.csv")),
lambda row: row[0]):
with open("%s.txt" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
awk
is capable:
awk -F "," '{print $0 >> ("FILE" $1)}' HUGE.csv
How about:
split()
each line on ,
to get the group_id