merge csv files with different column order remove duplicates

问题

I have multiple CSV files with same number of columns BUT different column orders in each , I wanted to merge them removing duplicates, all of the other solutions here dont consider column order hence merging output is incorrect, Hence how to do it in either windows commandline(e.g logparser) or bash?

Also python script to achieve this would also do.

回答1:

The following script works properly if:

csv aren't too big (i.e. can be loaded in memory)
the first row of the CSV contains the column names

You only have to fill files and final_headers

import csv

files = ['c1.csv', 'c2.csv', 'c3.csv']
final_headers = ['col1', 'col2', 'col3']

merged_rows = set()
for f in files:
    with open(f, 'rb') as csv_in:
        csvreader = csv.reader(csv_in, delimiter=',')
    headers = dict((h, i) for i, h in enumerate(csvreader.next()))
        for row in csvreader:
            merged_rows.add(tuple(row[headers[x]] for x in final_headers))
with open('output.csv', 'wb') as csv_out:
    csvwriter = csv.writer(csv_out, delimiter=',')
    csvwriter.writerows(merged_rows)

回答2:

csvkit's csvjoin can do that.

csvjoin -c "Column 1,Column 2" --outer file1.csv file2.csv

回答3:

Personally, I would separate the two tasks of merging files and removing duplicates. I would also recommend using a database instead of CSV files if that's an option, since managing columns in a database is easier.

Here is an example using Python, which has a csv library that is easy to use.

import csv
with open(srcPath, 'r') as srcCSV:
    csvReader = csv.reader(csvFile, delimiter = ',')

    with open(destPath, 'rw') as destCSV:
        csvWriter = csv.writer(destCSV, delimiter = ',')        

        for record in csvReader:
            csvWriter.writerow(record[1],record[3],record[2], ... record[n])

This allows you to rewrite the columns in any order you choose. The destination CSV could be an existing one that you expand, or it could be a new one with a better format. Using the CSV library will help prevent transcription errors that would happen elsewhere.

Once the data is consolidated, you could use the same library to iterate over the single data file to identify records that are identical.

Note: this method reads and writes files a line at a time, so it can process files of any size. I used this method to consolidate 221 millions records from files as large as 6 GB each.

来源：https://stackoverflow.com/questions/23283363/merge-csv-files-with-different-column-order-remove-duplicates

标签

python

bash

csv

duplicate-removal

merging-data