How to split a huge csv file based on content of first column?

后端未结

关注

 7  2040

一个人的身影

I have a 250MB+ huge csv file to upload
file format is group_id, application_id, reading and data could look like

相关标签:

7条回答

暖寄归人

2020-12-01 17:29
Sed one-liner:
```
sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile 
```
The only down-side is that you need to put in n -e statements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.

The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!
0 讨论(0)
发布评论:

提交评论
- 加载中...
借酒劲吻你

2020-12-01 17:32

If the rows are sorted by group_id, then itertools.groupby would be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Use csv to load the file (in case you didn't already know about it).

0 讨论(0)
发布评论:

提交评论
- 加载中...

执念已碎

2020-12-01 17:32

Here some food for though for you:

import csv
from collections import namedtuple

csvfile = namedtuple('scvfile',('file','writer'))

class CSVFileCollections(object):

    def __init__(self,prefix,postfix):
        self.prefix = prefix
        self.files = {}

    def __getitem__(self,item):
        if item not in self.files:
            file = open(self.prefix+str(item)+self.postfix,'wb')
            writer = csv.writer(file,delimiter = ',', quotechar = "'",quoting=csv.QUOTE_MINIMAL)
            self.files[item] = csvfile(file,writer) 
        return self.files[item].writer

    def __enter__(self): pass

    def __exit__(self, exc_type, exc_value, traceback):
        for csvfile in self.files.values() : csvfile.file.close()


with open('huge.csv') as readFile, CSVFileCollections('output','.csv') as output:
    reader = csv.reader(readFile, delimiter=",", quotechar="'")
    for row in reader:
        writer = output[row[0]]
        writer.writerow(row)

0 讨论(0)

忘掉有多难

2020-12-01 17:33

If the file is already sorted by group_id, you can do something like:

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("foo.csv")),
                         lambda row: row[0]):
    with open("%s.txt" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

0 讨论(0)

南笙

2020-12-01 17:40
awk is capable:
```
 awk -F "," '{print $0 >> ("FILE" $1)}' HUGE.csv
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2020-12-01 17:44
How about:
- Read the input file a line at a time
- split() each line on , to get the group_id
- For each new group_id you find, open an output file
  - add each groupid to a set/dict as you find them so you can keep track
- write the line to the appropriate file
- Done!
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页