Add header to CSV without loading CSV

前端 未结 3 2003
北海茫月
北海茫月 2020-12-11 02:46

Is there a way to add a header row to a CSV without loading the CSV into memory in python? I have an 18GB CSV I want to add a header to, and all the methods I\'ve seen requi

相关标签:
3条回答
  • 2020-12-11 03:14

    Just use the fact that csv module iterates on the rows, so it never loads the whole file in memory

    import csv
    
    with open("huge_csv.csv") as fr, open("huge_output.csv","w",newline='') as fw:
        cr = csv.reader(fr)
        cw = csv.writer(fw)
        cw.writerow(["title1","title2","title3"])
        cw.writerows(cr)
    

    using writerows ensure a very good speed. The memory is spared here. Everything is done line-by-line. Since the data is properly processed, you could even change the separator and/or the quoting in the output file.

    0 讨论(0)
  • 2020-12-11 03:17

    Here is a comparison of the three suggested solutions for a ~200 MB CSV file with 10^6 rows and 10 columns (n=50). The ratio stays approximately the same for larger and smaller files (10 MB to 8 GB).

    cp:shutil:csv_reader 1:10:55

    i.e. using the builtin cp function is approximately 55 times faster than using Python's csv module.

    Computer:

    • regular HDD
    • Python 3.5.2 64-bit
    • Ubuntu 16.04
    • i7-3770


    import csv
    import random
    import shutil
    import time
    import subprocess
    
    rows = 1 * 10**3
    cols = 10
    repeats = 50
    
    shell_script = '/tmp/csv.sh'
    input_csv = '/tmp/temp.csv'
    output_csv = '/tmp/huge_output.csv'
    col_titles = ['titles_' + str(i) for i in range(cols)]
    
    with open(shell_script, 'w') as f:
        f.write("#!/bin/bash\necho '{0}' > {1}\ncat {2} >> {1}".format(','.join(col_titles), output_csv, input_csv))
    
    with open(shell_script, 'w') as f:
        f.write("echo '{0}' > {1}\ncat {2} >> {1}".format(','.join(col_titles), output_csv, input_csv))
    subprocess.call(['chmod', '+x', shell_script])
    
    run_times = dict([
        ('csv_writer', list()),
        ('external', list()),
        ('shutil', list())
    ])
    
    def random_csv():
        with open(input_csv, 'w') as csvfile:
            csv_writer = csv.writer(csvfile, delimiter=',')
            for i in range(rows):
                csv_writer.writerow([str(random.random()) for i in range(cols)])
        with open(output_csv, 'w'):
            pass
    
    for r in range(repeats):
        random_csv()
        #http://stackoverflow.com/a/41982368/2776376
        start_time = time.time()
        with open(input_csv) as fr, open(output_csv, "w", newline='') as fw:
            cr = csv.reader(fr)
            cw = csv.writer(fw)
            cw.writerow(col_titles)
            cw.writerows(cr)
        run_times['csv_writer'].append(time.time() - start_time)
    
        random_csv()
        #http://stackoverflow.com/a/41982383/2776376
        start_time = time.time()
        subprocess.call(['bash', shell_script])
        run_times['external'].append(time.time() - start_time)
    
        random_csv()
        #http://stackoverflow.com/a/41982383/2776376
        start_time = time.time()
        with open('header.txt', 'w') as header_file:
            header_file.write(','.join(col_titles))
    
        with open(output_csv, 'w') as new_file:
            with open('header.txt', 'r') as header_file, open(input_csv, 'r') as main_file:
                shutil.copyfileobj(header_file, new_file)
                shutil.copyfileobj(main_file, new_file)
        run_times['shutil'].append(time.time() - start_time)
    
        print('#'*20)
        for key in run_times:
            print('{0}: {1:.2f} seconds'.format(key, run_times[key][-1]))
    
    print('#'*20)
    print('Averages')
    for key in run_times:
        print('{0}: {1:.2f} seconds'.format(key, sum(run_times[key])/len(run_times[key])))
    

    If you really want to do it in Python, you could create the header file first and then merge it with your 2nd file via shutil.copyfileobj.

    import shutil
    with open('header.txt', 'w') as header_file:
        header_file.write('col1;col2;col3')
    
    with open('new_file.csv', 'w') as new_file:
        with open('header.txt', 'r') as header_file, open('main.csv', 'r') as main_file:
            shutil.copyfileobj(header_file, new_file)
            shutil.copyfileobj(main_file, new_file)
    
    0 讨论(0)
  • 2020-12-11 03:29

    You will need to rewrite the whole file. Simplest is not to use python

    echo 'col1, col2, col2,... ' > out.csv
    cat in.csv >> out.csv
    

    Python based solutions will work at much higher levels and will be a lot slower. 18GB is a lot of data after all. Better to work with operating system functionality, which will be the fastest.

    0 讨论(0)
提交回复
热议问题