Compare 2 seperate csv files and write difference to a new csv file - Python 2.7

问题

I am trying to compare two csv files in python and save the difference to a third csv file in python 2.7.

import csv

f1 = open ("olddata/file1.csv")
oldFile1 = csv.reader(f1)
oldList1 = []
for row in oldFile1:
    oldList1.append(row)

f2 = open ("newdata/file2.csv")
oldFile2 = csv.reader(f2)
oldList2 = []
for row in oldFile2:
    oldList2.append(row)

f1.close()
f2.close()

set1 = tuple(oldList1)
set2 = tuple(oldList2)

print oldList2.difference(oldList1)

I get the error message:

Traceback (most recent call last):
  File "compare.py", line 21, in <module>
    print oldList2.difference(oldList1)
AttributeError: 'list' object has no attribute 'difference'

I am new to python, and coding in general, and I am not done with this code just yet (I have to make sure to store the differences to a variable and write the difference to a new csv file.). I have been trying to solve this all day and I simply can't. Your help would be greatly appreciated.

回答1:

What do you mean by difference? The answer to that gives you two distinct possibilities.

If a row is considered same when all columns are same, then you can get your answer via the following code:

import csv

f1 = open ("olddata/file1.csv")
oldFile1 = csv.reader(f1)
oldList1 = []
for row in oldFile1:
    oldList1.append(row)

f2 = open ("newdata/file2.csv")
oldFile2 = csv.reader(f2)
oldList2 = []
for row in oldFile2:
    oldList2.append(row)

f1.close()
f2.close()

print [row for row in oldList1 if row not in oldList2]

However, if two rows are same if a certain key field (i.e. column) is same, then the following code will give you your answer:

import csv

f1 = open ("olddata/file1.csv")
oldFile1 = csv.reader(f1)
oldList1 = []
for row in oldFile1:
    oldList1.append(row)

f2 = open ("newdata/file2.csv")
oldFile2 = csv.reader(f2)
oldList2 = []
for row in oldFile2:
    oldList2.append(row)

f1.close()
f2.close()

keyfield = 0 # Change this for choosing the column number

oldList2keys = [row[keyfield] for row in oldList2]
print [row for row in oldList1 if row[keyfield] not in oldList2keys]

Note: The above code might run slow for extremely large files. If instead, you wish to speed up code through hashing, you can use set after converting the oldLists using the following code:

set1 = set(tuple(row) for row in oldList1)
set2 = set(tuple(row) for row in oldList2)

After this, you can use set1.difference(set2)

回答2:

import csv

def read_csv_file(filename):
    res = []
    with open(filename) as f:
         for line in csv.reader(f):
               res.append(line)


oldList1 = read_csv_file("olddata/file1.csv")
oldList2 = read_csv_file("olddata/file2.csv")


difference_list = []

for a,b in zip(oldList1,oldList2):
   if a != b:
       difference_list.append(a + '\t' + b)

Eventually you have a list of items and you can just write them to file.

EDIT: In this situation, [a,b,c] vs [b,c,a] will fail. If you know that [a,b,c] vs [b,c,a] should return no difference, use the following code pls.

import csv

def read_csv_file(filename):
    res = []
    with open(filename) as f:
         for line in csv.reader(f):
               res.append(line)


oldList1 = read_csv_file("olddata/file1.csv")
oldList2 = read_csv_file("olddata/file2.csv")


difference_list = []

for a in oldList1:
  for b in oldList2:
    if a != b:
       difference_list.append(a + '\t' + b)

回答3:

The error is correct: tuple has no "difference" method.

I guess you want to use set (and make the elements immutable)?

set1 = set([tuple(item) for item in oldList1])
set2 = set([tuple(item) for item in oldList2])

来源：https://stackoverflow.com/questions/30852710/compare-2-seperate-csv-files-and-write-difference-to-a-new-csv-file-python-2-7

标签

python

python-2.7

csv

compare