How do I merge two CSV files based on field and keep same number of attributes on each record?

前端 未结 3 1245
感情败类
感情败类 2021-02-06 11:00

I am attempting to merge two CSV files based on a specific field in each file.

file1.csv

id,attr1,attr2,attr3
1,True,7,\"Purple\"
2,Fal         


        
相关标签:
3条回答
  • 2021-02-06 11:30

    You can use pandas to do this:

    import pandas
    
    csv1 = pandas.read_csv('filea1.csv')
    csv2 = pandas.read_csv('file2.csv')
    merged = csv1.merge(csv2, on='id')
    merged.to_csv("output.csv", index=False)
    

    I haven't tested this yet but it should put you on the right track until I can try it out. The code is quite self-explanatory; first you import the pandas library so that you can use it. Then using pandas.read_csv you read the 2 csv files and use the merge method to merge them. The on parameter specifies which column should be used as the "key". Finally, the merged csv is written to output.csv.

    0 讨论(0)
  • 2021-02-06 11:48

    If we're not using pandas, I'd refactor to something like

    import csv
    from collections import OrderedDict
    
    filenames = "file1.csv", "file2.csv"
    data = OrderedDict()
    fieldnames = []
    for filename in filenames:
        with open(filename, "rb") as fp: # python 2
            reader = csv.DictReader(fp)
            fieldnames.extend(reader.fieldnames)
            for row in reader:
                data.setdefault(row["id"], {}).update(row)
    
    fieldnames = list(OrderedDict.fromkeys(fieldnames))
    with open("merged.csv", "wb") as fp:
        writer = csv.writer(fp)
        writer.writerow(fieldnames)
        for row in data.itervalues():
            writer.writerow([row.get(field, '') for field in fieldnames])
    

    which gives

    id,attr1,attr2,attr3,attr4,attr5,attr6
    1,True,7,Purple,,,
    2,False,19.8,Cucumber,python,500000.12,False
    3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
    4,True,2,Nope,,,
    5,True,4.0,Tuesday,program,3,True
    6,False,1,Failure,,,
    

    For comparison, the pandas equivalent would be something like

    df1 = pd.read_csv("file1.csv")
    df2 = pd.read_csv("file2.csv")
    merged = df1.merge(df2, on="id", how="outer").fillna("")
    merged.to_csv("merged.csv", index=False)
    

    which is much simpler to my eyes, and means you can spend more time dealing with your data and less time reinventing wheels.

    0 讨论(0)
  • 2021-02-06 11:51

    Use dict of dict then update it. Like this:

    import csv
    from collections import OrderedDict
    
    with open('file2.csv','r') as f2:
        reader = csv.reader(f2)
        lines2 = list(reader)
    
    with open('file1.csv','r') as f1:
        reader = csv.reader(f1)
        lines1 = list(reader)
    
    dict1 = {row[0]: dict(zip(lines1[0][1:], row[1:])) for row in lines1[1:]}
    dict2 = {row[0]: dict(zip(lines2[0][1:], row[1:])) for row in lines2[1:]}
    
    #merge
    updatedDict = OrderedDict()
    mergedAttrs = OrderedDict.fromkeys(lines1[0][1:] + lines2[0][1:], "?")
    for id, attrs in dict1.iteritems():
        d = mergedAttrs.copy()
        d.update(attrs)
        updatedDict[id] = d
    
    for id, attrs in dict2.iteritems():
        updatedDict[id].update(attrs)
    
    #out
    with open('merged.csv', 'wb') as f:
        w = csv.writer(f)
        for id, rest in sorted(updatedDict.iteritems()):
            w.writerow([id] + rest.values())
    
    0 讨论(0)
提交回复
热议问题