More efficient way to find & tar millions of files

前端未结

关注

 9  600

一向 2021-01-30 17:53

I\'ve got a job running on my server at the command line prompt for a two days now:

find data/ -name filepattern-*2009* -exec tar uf 2009.tar {} ;

9条回答

醉梦人生 (楼主)

2021-01-30 18:59

I was struggling with linux for a long time before I found a much easier and potentially faster solution using Python's tarfile library.

Use glob.glob to search for the desired filepaths
Create a new archive in append mode
Add each filepath to this archive
Close the archive

Here is my code sample:

import tarfile
import glob
from tqdm import tqdm

filepaths = glob.glob("Images/7 *.jpeg")
n = len(filepaths)
print ("{} files found.".format(n))
print ("Creating Archive...")
out = tarfile.open("Images.tar.gz", mode = "a")
for filepath in tqdm(filepaths, "Appending files to the archive..."):
  try:
    out.add(filepath)
  except:
    print ("Failed to add: {}".format(filepath))

print ("Closing the archive...")
out.close()

This took a total of about 12 seconds to find 16222 filepaths and create the archive, however, this was predominantly taken up by simply searching for the filepaths. It took just 7 seconds to create the tar archive with 16000 filepaths. With some multithreading this could be much faster.

If you're looking for a multithreaded implementation, I've made one and placed it here:

import tarfile
import glob
from tqdm import tqdm
import threading

filepaths = glob.glob("Images/7 *.jpeg")
n = len(filepaths)
print ("{} files found.".format(n))
print ("Creating Archive...")
out = tarfile.open("Images.tar.gz", mode = "a")

def add(filepath):
  try:
    out.add(filepath)
  except:
    print ("Failed to add: {}".format(filepath))

def add_multiple(filepaths):
  for filepath in filepaths:
    add(filepath)

max_threads = 16
filepaths_per_thread = 16

interval = max_threads * filepaths_per_thread

for i in tqdm(range(0, n, interval), "Appending files to the archive..."):
  threads = [threading.Thread(target = add_multiple, args = (filepaths[j:j + filepaths_per_thread],)) for j in range(i, min([n, i + interval]), filepaths_per_thread)]
  for thread in threads:
    thread.start()
  for thread in threads:
    thread.join()

print ("Closing the archive...")
out.close()

Of course, you need to make sure that the values of max_threads and filepaths_per_thread are optimized; it takes time to create threads, so the time may actually increase for certain values. A final thing to note is that since we are using append mode, we are automatically creating a new archive with the designated name if one does not already exist. However, if one does already exist, it will simply add to the preexisting archive, not reset it or make a new one.

0 讨论(0)

查看其它9个回答