I have two .csv files where the first line in file 1 is:
MPID,Title,Description,Model,Category ID,Category Description,Subcategory ID,Subcategory Description
You'll need to look at the join
command in the shell. You will also need to sort the data, and probably lose the first lines. The whole process will fall flat if any of the data contains commas. Or you will need to process the data with a CSV-sensitive process that introduces a different field separator (perhaps control-A) that you can use to split fields unambiguously.
The alternative, using Python, reads the two files into a pair of dictionaries (keyed on the common column(s)) and then use a loop to cover all the elements in the smaller of the two dictionaries, looking for matching values in the other. (This is basic nested loop query processing.)
sort -t , -k index1 file1 > sorted1
sort -t , -k index2 file2 > sorted2
join -t , -1 index1 -2 index2 -a 1 -a 2 sorted1 sorted2
This is the classical "relational join" problem.
You have several algorithms.
Nested Loops. You read from one file to pick a "master" record. You read the entire other file locating all "detail" records that match the master. This is a bad idea.
Sort-Merge. You sort each file into a temporary copy based on the common key. You then merge both files by reading from the master and then reading all matching rows from the detail and writing the merged records.
Lookup. You read one of the files entirely into a dictionary in memory, indexed by the key field. This can be tricky for the detail file, where you'll have multiple children per key. Then you read the other file and lookup the matching records in the dictionary.
Of these, sort-merge is often the fastest. This is done entirely using the unix sort command.
Lookup Implementation
import csv
import collections
index = collections.defaultdict(list)
file1= open( "someFile", "rb" )
rdr= csv.DictReader( file1 )
for row in rdr:
index[row['MPID']].append( row )
file1.close()
file2= open( "anotherFile", "rb" )
rdr= csv.DictReader( file2 )
for row in rdr:
print row, index[row['MPID']]
file2.close()
You could take a look at my FOSS project CSVfix, which is a stream editor for manipulating CSV files. It supports joins, among its other features, and requires no scripting to use.
It seems that you're trying to do in a shell script, which is commonly done using SQL server. Is it possible to use SQL for that task? For example, you could import both files into mysql, then create a join, then export it to CSV.
For merging multiple files (even > 2) based on one or more common columns, one of the best and efficient approaches in python would be to use "brewery". You could even specify what fields need to be considered for merging and what fields need to be saved.
import brewery
from brewery
import ds
import sys
sources = [
{"file": "grants_2008.csv",
"fields": ["receiver", "amount", "date"]},
{"file": "grants_2009.csv",
"fields": ["id", "receiver", "amount", "contract_number", "date"]},
{"file": "grants_2010.csv",
"fields": ["receiver", "subject", "requested_amount", "amount", "date"]}
]
Create list of all fields and add filename to store information about origin of data records.Go through source definitions and collect the fields:
for source in sources:
for field in source["fields"]:
if field not in all_fields:
out = ds.CSVDataTarget("merged.csv")
out.fields = brewery.FieldList(all_fields)
out.initialize()
for source in sources:
path = source["file"]
# Initialize data source: skip reading of headers
# use XLSDataSource for XLS files
# We ignore the fields in the header, because we have set-up fields
# previously. We need to skip the header row.
src = ds.CSVDataSource(path,read_header=False,skip_rows=1)
src.fields = ds.FieldList(source["fields"])
src.initialize()
for record in src.records():
# Add file reference into ouput - to know where the row comes from
record["file"] = path
out.append(record)
# Close the source stream
src.finalize()
cat merged.csv | brewery pipe pretty_printer