How to split a huge csv file based on content of first column?

后端 未结 7 2040
一个人的身影
一个人的身影 2020-12-01 17:01
  • I have a 250MB+ huge csv file to upload
  • file format is group_id, application_id, reading and data could look like


        
相关标签:
7条回答
  • 2020-12-01 17:29

    Sed one-liner:

    sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile 
    

    The only down-side is that you need to put in n -e statements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.

    The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!

    0 讨论(0)
  • 2020-12-01 17:32

    If the rows are sorted by group_id, then itertools.groupby would be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Use csv to load the file (in case you didn't already know about it).

    0 讨论(0)
  • 2020-12-01 17:32

    Here some food for though for you:

    import csv
    from collections import namedtuple
    
    csvfile = namedtuple('scvfile',('file','writer'))
    
    class CSVFileCollections(object):
    
        def __init__(self,prefix,postfix):
            self.prefix = prefix
            self.files = {}
    
        def __getitem__(self,item):
            if item not in self.files:
                file = open(self.prefix+str(item)+self.postfix,'wb')
                writer = csv.writer(file,delimiter = ',', quotechar = "'",quoting=csv.QUOTE_MINIMAL)
                self.files[item] = csvfile(file,writer) 
            return self.files[item].writer
    
        def __enter__(self): pass
    
        def __exit__(self, exc_type, exc_value, traceback):
            for csvfile in self.files.values() : csvfile.file.close()
    
    
    with open('huge.csv') as readFile, CSVFileCollections('output','.csv') as output:
        reader = csv.reader(readFile, delimiter=",", quotechar="'")
        for row in reader:
            writer = output[row[0]]
            writer.writerow(row)
    
    0 讨论(0)
  • 2020-12-01 17:33

    If the file is already sorted by group_id, you can do something like:

    import csv
    from itertools import groupby
    
    for key, rows in groupby(csv.reader(open("foo.csv")),
                             lambda row: row[0]):
        with open("%s.txt" % key, "w") as output:
            for row in rows:
                output.write(",".join(row) + "\n")
    
    0 讨论(0)
  • 2020-12-01 17:40

    awk is capable:

     awk -F "," '{print $0 >> ("FILE" $1)}' HUGE.csv
    
    0 讨论(0)
  • 2020-12-01 17:44

    How about:

    • Read the input file a line at a time
    • split() each line on , to get the group_id
    • For each new group_id you find, open an output file
      • add each groupid to a set/dict as you find them so you can keep track
    • write the line to the appropriate file
    • Done!
    0 讨论(0)
提交回复
热议问题