How can I break down a large csv file into small files based on common records by python

前端未结

关注

 2  1172

谎友^ 2021-01-29 14:18

What I want to do:

What I want to do is that I have a big .csv file. I want to break down this big csv file into many small files based on the common records in BB colum

2条回答

日久生厌 (楼主)

2021-01-29 14:50

For the data you have provided, the following script will produce your requested output files. It will perform this operation on ALL CSV files found in the folder:

from itertools import groupby
import glob
import csv
import os

def remove_unwanted(rows):
    return [['' if col == 'NULL' else col for col in row[2:]] for row in rows]

output_folder = 'temp'  # make sure this folder exists

# Search for ALL CSV files in the current folder
for csv_filename in glob.glob('*.csv'):
    with open(csv_filename) as f_input:
        basename = os.path.splitext(os.path.basename(csv_filename))[0]      # e.g. bigfile

        csv_input = csv.reader(f_input)
        header = next(csv_input)
        # Create a list of entries with '0' in last column
        id_list = remove_unwanted(row for row in csv_input if row[7] == '0')
        f_input.seek(0)     # Go back to the start
        header = remove_unwanted([next(csv_input)])

        for k, g in groupby(csv_input, key=lambda x: x[1]):
            if k == '':
                break

            # Format an output file name in the form 'bigfile_53.csv'
            file_name = os.path.join(output_folder, '{}_{}.csv'.format(basename, k))

            with open(file_name, 'wb') as f_output:
                csv_output = csv.writer(f_output)
                csv_output.writerows(header)
                csv_output.writerows(remove_unwanted(g))
                csv_output.writerows(id_list)

This will result in the files bigfile_53.csv, bigfile_59.csv and bigfile_61.csv being created in an output folder called temp. For example bigfile_53.csv will appear as follows:

Entries containing the string 'NULL' will be converted to an empty string, and the first two columns will be removed (as per OP's comment).

Tested in Python 2.7.9

0 讨论(0)

查看其它2个回答