Merge CSVs in Python with different columns

后端 未结 4 879
栀梦
栀梦 2021-02-05 18:47

I have hundreds of large CSV files that I would like to merge into one. However, not all CSV files contain all columns. Therefore, I need to merge files based on column name, no

相关标签:
4条回答
  • 2021-02-05 19:28

    The csv.DictReader and csv.DictWriter classes should work well (see Python docs). Something like this:

    import csv
    inputs = ["in1.csv", "in2.csv"]  # etc
    
    # First determine the field names from the top line of each input file
    # Comment 1 below
    fieldnames = []
    for filename in inputs:
      with open(filename, "r", newline="") as f_in:
        reader = csv.reader(f_in)
        headers = next(reader)
        for h in headers:
          if h not in fieldnames:
            fieldnames.append(h)
    
    # Then copy the data
    with open("out.csv", "w", newline="") as f_out:   # Comment 2 below
      writer = csv.DictWriter(f_out, fieldnames=fieldnames)
      for filename in inputs:
        with open(filename, "r", newline="") as f_in:
          reader = csv.DictReader(f_in)  # Uses the field names in this file
          for line in reader:
            # Comment 3 below
            writer.writerow(line)
    

    Comments from above:

    1. You need to specify all the possible field names in advance to DictWriter, so you need to loop through all your CSV files twice: once to find all the headers, and once to read the data. There is no better solution, because all the headers need to be known before DictWriter can write the first line. This part would be more efficient using sets instead of lists (the in operator on a list is comparatively slow), but it won't make much difference for a few hundred headers. Sets would also lose the deterministic ordering of a list - your columns would come out in a different order each time you ran the code.
    2. The above code is for Python 3, where weird things happen in the CSV module without newline="". Remove this for Python 2.
    3. At this point, line is a dict with the field names as keys, and the column data as values. You can specify what to do with blank or unknown values in the DictReader and DictWriter constructors.

    This method should not run out of memory, because it never has the whole file loaded at once.

    0 讨论(0)
  • 2021-02-05 19:34

    For those of us using 2.7, this adds an extra linefeed between records in "out.csv". To resolve this, just change the file mode from "w" to "wb".

    0 讨论(0)
  • 2021-02-05 19:38

    The solution by @Aaron Lockey, which is the accepted answer has worked well for me except, there were no headers for the file. The out put had no headers and only the row data. Each column was without headings (keys). So I inserted following:

    writer.writeheader()
    

    and it worked perfectly fine for me! So now the entire code appears like this:

        import csv
        ``inputs = ["in1.csv", "in2.csv"]  # etc
    
        # First determine the field names from the top line of each input file
    
    `# Comment 1 below
    
        `fieldnames = []
    
    
      with open(filename, "r", newline="") as f_in:
        reader = csv.reader(f_in)
        headers = next(reader)
        for h in headers:
          if h not in fieldnames:
            fieldnames.append(h)
    
    # Then copy the data
    with open("out.csv", "w", newline="") as f_out:   # Comment 2 below
      writer = csv.DictWriter(f_out, fieldnames=fieldnames)
    writer.writeheader() #this is the addition.       
    for filename in inputs:
            with open(filename, "r", newline="") as f_in:
              reader = csv.DictReader(f_in)  # Uses the field names in this file
              for line in reader:
                # Comment 3 below
                writer.writerow(line)
    
    0 讨论(0)
  • 2021-02-05 19:46

    You can use the pandas module to do this pretty easily. This snippet assumes all your csv files are in the current folder.

    import pandas as pd
    import os
    
    all_csv = [file_name for file_name in os.listdir(os.getcwd()) if '.csv' in file_name]
    
    li = []
    
    for filename in all_csv:
        df = pd.read_csv(filename, index_col=None, header=0, parse_dates=True, infer_datetime_format=True)
        li.append(df)
    
    frame = pd.concat(li, axis=0, ignore_index=True)
    frame.to_csv('melted_csv.csv', index=False)
    
    0 讨论(0)
提交回复
热议问题