I\'ve got a job running on my server at the command line prompt for a two days now:
find data/ -name filepattern-*2009* -exec tar uf 2009.tar {} ;
I was struggling with linux for a long time before I found a much easier and potentially faster solution using Python's tarfile library.
Here is my code sample:
import tarfile
import glob
from tqdm import tqdm
filepaths = glob.glob("Images/7 *.jpeg")
n = len(filepaths)
print ("{} files found.".format(n))
print ("Creating Archive...")
out = tarfile.open("Images.tar.gz", mode = "a")
for filepath in tqdm(filepaths, "Appending files to the archive..."):
try:
out.add(filepath)
except:
print ("Failed to add: {}".format(filepath))
print ("Closing the archive...")
out.close()
This took a total of about 12 seconds to find 16222 filepaths and create the archive, however, this was predominantly taken up by simply searching for the filepaths. It took just 7 seconds to create the tar archive with 16000 filepaths. With some multithreading this could be much faster.
If you're looking for a multithreaded implementation, I've made one and placed it here:
import tarfile
import glob
from tqdm import tqdm
import threading
filepaths = glob.glob("Images/7 *.jpeg")
n = len(filepaths)
print ("{} files found.".format(n))
print ("Creating Archive...")
out = tarfile.open("Images.tar.gz", mode = "a")
def add(filepath):
try:
out.add(filepath)
except:
print ("Failed to add: {}".format(filepath))
def add_multiple(filepaths):
for filepath in filepaths:
add(filepath)
max_threads = 16
filepaths_per_thread = 16
interval = max_threads * filepaths_per_thread
for i in tqdm(range(0, n, interval), "Appending files to the archive..."):
threads = [threading.Thread(target = add_multiple, args = (filepaths[j:j + filepaths_per_thread],)) for j in range(i, min([n, i + interval]), filepaths_per_thread)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print ("Closing the archive...")
out.close()
Of course, you need to make sure that the values of max_threads
and filepaths_per_thread
are optimized; it takes time to create threads, so the time may actually increase for certain values. A final thing to note is that since we are using append mode, we are automatically creating a new archive with the designated name if one does not already exist. However, if one does already exist, it will simply add to the preexisting archive, not reset it or make a new one.