问题
I have a set of 100 files. 50 files containing census information for each US state. The other fifty are geographic data that need to be merged with the correct file for each state.
For each state, the census file and its corresponding geo file are related by a common variable, LOGRECNO, that is the 10th column in the census file and the 7th column in the geo file.
The problem is that the geo file has more rows than the census file; my census data does not cover certain subsets of geographic locations and hence has fewer rows than the geo data file.
How can I merge the census data with the geographic date (keeping only the rows/geo locations where census data exists, don't care about the rest)?
I am a newbie to Python and I somewhat know how to do basic csv file i/o in python. Manipulating 2 csvs at the same time is proving confusing.
Example:
sample_state_census.csv
Varname 1 Varname 2 ... Varname 10 (LOGRECNO) ... Varname 16000
xxx xxx ... 1 ... xxx
xxx xxx ... 2 ... xxx
...
...
xxx xxx ... 514 ... xxx
xxx xxx ... 1312 ... xxx
...
...
xxx xxx ... 1500 ... xxx
sample_state_geo.csv
GeoVarname 1 GeoVarname 2 ... GeoVarname 7 (LOGRECNO) ... GeoVarname 65
yyy yyy ... 1 ... yyy
yyy yyy ... 2 ... yyy
...
...
yyy yyy ... 514 ... yyy
yyy yyy ... 515 ... yyy
...
...
yyy yyy ... 1500 ... yyy
Expected output (don't merge rows for values of LOGRECNO that don't exist in sample_state_census.csv)
Varname 1 Varname 2 ... Varname 10 (LOGRECNO) GeoVarname 1 GeoVarname 2 ... GeoVarname 65 Varname 11... Varname 16000
xxx xxx ... 1 yyy yyy ... yyy xxx ... xxx
xxx xxx ... 2 yyy yyy ... yyy xxx ... xxx
...
...
xxx xxx ... 514 yyy yyy ... yyy xxx ... xxx
xxx xxx ... 1312 yyy yyy ... yyy xxx ... xxx
...
...
xxx xxx ... 1500 yyy yyy ... yyy xxx ... xxx
回答1:
Read data from the shorter file into memory, into a dictionary keyed on the LOGRECNO
row:
import csv
with open('sample_state_census.csv', 'rb') as census_file:
reader = csv.reader(census_file, delimiter='\t')
census_header = next(reader, None) # store header
census = {row[9]: row for row in reader}
then use this dictionary to match against the geo data, write out matches:
with open('sample_state_geo.csv', 'rb') as geo_file:
with open('outputfile.csv', 'wd') as outfile:
reader = csv.reader(geo_file, delimiter='\t')
geo_header = next(reader, None) # grab header
geo_header.pop(6) # no need to list LOGRECNO header twice
writer = csv.writer(outfile, delimiter='\t')
writer.writerow(census_header + geo_header)
for row in reader:
if row[6] not in census:
# no census data for this LOGRECNO entry
continue
# new row is all of the census data plus all of geo minus column 7
newrow = census[row[6]] + row[:6] + row[7:]
writer.writerow(newrow)
This all assumes the census file is not so big as to take up too much memory. If that's the case you'll have to use a database instead (read all data into a SQLite database, match in the same vein agains the geo data).
回答2:
For merging multiple files (even > 2) based on one or more common columns, one of the best and efficient approaches in python would be to use "brewery". You could even specify what fields need to be considered for merging and what fields need to be saved.
import brewery
from brewery
import ds
import sys
sources = [
{"file": "grants_2008.csv",
"fields": ["receiver", "amount", "date"]},
{"file": "grants_2009.csv",
"fields": ["id", "receiver", "amount", "contract_number", "date"]},
{"file": "grants_2010.csv",
"fields": ["receiver", "subject", "requested_amount", "amount", "date"]}
]
Create list of all fields and add filename to store information
about origin of data records
all_fields = brewery.FieldList(["file"])
Go through source definitions and collect the fields
for source in sources:
for field in source["fields"]:
if field not in all_fields:
out = ds.CSVDataTarget("merged.csv")
out.fields = brewery.FieldList(all_fields)
out.initialize()
for source in sources:
path = source["file"]
# Initialize data source: skip reading of headers
# use XLSDataSource for XLS files
# We ignore the fields in the header, because we have set-up fields
# previously. We need to skip the header row.
src = ds.CSVDataSource(path,read_header=False,skip_rows=1)
src.fields = ds.FieldList(source["fields"])
src.initialize()
for record in src.records():
# Add file reference into ouput - to know where the row comes from
record["file"] = path
out.append(record)
# Close the source stream
src.finalize()
cat merged.csv | brewery pipe pretty_printer
来源:https://stackoverflow.com/questions/18689453/how-can-i-merge-two-csv-files-by-a-common-column-in-the-case-of-unequal-rows