More efficient way to find & tar millions of files

前端 未结 9 600
一向
一向 2021-01-30 17:53

I\'ve got a job running on my server at the command line prompt for a two days now:

find data/ -name filepattern-*2009* -exec tar uf 2009.tar {} ;
9条回答
  •  醉梦人生
    2021-01-30 18:59

    I was struggling with linux for a long time before I found a much easier and potentially faster solution using Python's tarfile library.

    1. Use glob.glob to search for the desired filepaths
    2. Create a new archive in append mode
    3. Add each filepath to this archive
    4. Close the archive

    Here is my code sample:

    import tarfile
    import glob
    from tqdm import tqdm
    
    filepaths = glob.glob("Images/7 *.jpeg")
    n = len(filepaths)
    print ("{} files found.".format(n))
    print ("Creating Archive...")
    out = tarfile.open("Images.tar.gz", mode = "a")
    for filepath in tqdm(filepaths, "Appending files to the archive..."):
      try:
        out.add(filepath)
      except:
        print ("Failed to add: {}".format(filepath))
    
    print ("Closing the archive...")
    out.close()
    

    This took a total of about 12 seconds to find 16222 filepaths and create the archive, however, this was predominantly taken up by simply searching for the filepaths. It took just 7 seconds to create the tar archive with 16000 filepaths. With some multithreading this could be much faster.

    If you're looking for a multithreaded implementation, I've made one and placed it here:

    import tarfile
    import glob
    from tqdm import tqdm
    import threading
    
    filepaths = glob.glob("Images/7 *.jpeg")
    n = len(filepaths)
    print ("{} files found.".format(n))
    print ("Creating Archive...")
    out = tarfile.open("Images.tar.gz", mode = "a")
    
    def add(filepath):
      try:
        out.add(filepath)
      except:
        print ("Failed to add: {}".format(filepath))
    
    def add_multiple(filepaths):
      for filepath in filepaths:
        add(filepath)
    
    max_threads = 16
    filepaths_per_thread = 16
    
    interval = max_threads * filepaths_per_thread
    
    for i in tqdm(range(0, n, interval), "Appending files to the archive..."):
      threads = [threading.Thread(target = add_multiple, args = (filepaths[j:j + filepaths_per_thread],)) for j in range(i, min([n, i + interval]), filepaths_per_thread)]
      for thread in threads:
        thread.start()
      for thread in threads:
        thread.join()
    
    print ("Closing the archive...")
    out.close()
    

    Of course, you need to make sure that the values of max_threads and filepaths_per_thread are optimized; it takes time to create threads, so the time may actually increase for certain values. A final thing to note is that since we are using append mode, we are automatically creating a new archive with the designated name if one does not already exist. However, if one does already exist, it will simply add to the preexisting archive, not reset it or make a new one.

提交回复
热议问题