python script to concatenate all the files in the directory into one file

后端 未结 6 1060
甜味超标
甜味超标 2020-12-15 07:55

I have written the following script to concatenate all the files in the directory into one single file.

Can this be optimized, in terms of

  1. idiomat

相关标签:
6条回答
  • 2020-12-15 08:07

    Using Python 2.7, I did some "benchmark" testing of

    outfile.write(infile.read())
    

    vs

    shutil.copyfileobj(readfile, outfile)
    

    I iterated over 20 .txt files ranging in size from 63 MB to 313 MB with a joint file size of ~ 2.6 GB. In both methods, normal read mode performed better than binary read mode and shutil.copyfileobj was generally faster than outfile.write.

    When comparing the worst combination (outfile.write, binary mode) with the best combination (shutil.copyfileobj, normal read mode), the difference was quite significant:

    outfile.write, binary mode: 43 seconds, on average.
    
    shutil.copyfileobj, normal mode: 27 seconds, on average.
    

    The outfile had a final size of 2620 MB in normal read mode vs 2578 MB in binary read mode.

    0 讨论(0)
  • 2020-12-15 08:11

    I was curious to check more on performance and I used answers of Martijn Pieters and Stephen Miller.

    I tried binary and text modes with shutil and without shutil. I tried to merge 270 files.

    Text mode -

    def using_shutil_text(outfilename):
        with open(outfilename, 'w') as outfile:
            for filename in glob.glob('*.txt'):
                if filename == outfilename:
                    # don't want to copy the output into the output
                    continue
                with open(filename, 'r') as readfile:
                    shutil.copyfileobj(readfile, outfile)
    
    def without_shutil_text(outfilename):
        with open(outfilename, 'w') as outfile:
            for filename in glob.glob('*.txt'):
                if filename == outfilename:
                    # don't want to copy the output into the output
                    continue
                with open(filename, 'r') as readfile:
                    outfile.write(readfile.read())
    

    Binary mode -

    def using_shutil_text(outfilename):
        with open(outfilename, 'wb') as outfile:
            for filename in glob.glob('*.txt'):
                if filename == outfilename:
                    # don't want to copy the output into the output
                    continue
                with open(filename, 'rb') as readfile:
                    shutil.copyfileobj(readfile, outfile)
    
    def without_shutil_text(outfilename):
        with open(outfilename, 'wb') as outfile:
            for filename in glob.glob('*.txt'):
                if filename == outfilename:
                    # don't want to copy the output into the output
                    continue
                with open(filename, 'rb') as readfile:
                    outfile.write(readfile.read())
    

    Running times for binary mode -

    Shutil - 20.161773920059204
    Normal - 17.327500820159912
    

    Running times for text mode -

    Shutil - 20.47757601737976
    Normal - 13.718038082122803
    

    Looks like in both modes, shutil performs same while text mode is faster than binary.

    OS: Mac OS 10.14 Mojave. Macbook Air 2017.

    0 讨论(0)
  • 2020-12-15 08:17

    You can iterate over the lines of a file object directly, without reading the whole thing into memory:

    with open(fname, 'r') as readfile:
        for line in readfile:
            outfile.write(line)
    
    0 讨论(0)
  • 2020-12-15 08:18

    No need to use that many variables.

    with open(outfilename, 'w') as outfile:
        for fname in filenames:
            with open(fname, 'r') as readfile:
                outfile.write(readfile.read() + "\n\n")
    
    0 讨论(0)
  • 2020-12-15 08:26

    The fileinput module provides a natural way to iterate over multiple files

    for line in fileinput.input(glob.glob("*.txt")):
        outfile.write(line)
    
    0 讨论(0)
  • 2020-12-15 08:27

    Use shutil.copyfileobj to copy data:

    import shutil
    
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                shutil.copyfileobj(readfile, outfile)
    

    shutil reads from the readfile object in chunks, writing them to the outfile fileobject directly. Do not use readline() or a iteration buffer, since you do not need the overhead of finding line endings.

    Use the same mode for both reading and writing; this is especially important when using Python 3; I've used binary mode for both here.

    0 讨论(0)
提交回复
热议问题