Faster way to find large files with Python?

后端 未结 2 352
走了就别回头了
走了就别回头了 2021-01-16 20:20

I am trying to use Python to find a faster way to sift through a large directory(approx 1.1TB) containing around 9 other directories and finding files larger than, say, 200G

2条回答
  •  走了就别回头了
    2021-01-16 21:03

    That's not true that you cannot do better than os.walk()

    scandir is said to be 2 to 20 times faster.

    From https://pypi.python.org/pypi/scandir

    Python’s built-in os.walk() is significantly slower than it needs to be, because – in addition to calling listdir() on each directory – it calls stat() on each file to determine whether the filename is a directory or not. But both FindFirstFile / FindNextFile on Windows and readdir on Linux/OS X already tell you whether the files returned are directories or not, so no further stat system calls are needed. In short, you can reduce the number of system calls from about 2N to N, where N is the total number of files and directories in the tree.

    In practice, removing all those extra system calls makes os.walk() about 7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X. So we’re not talking about micro-optimizations.

    From python 3.5, thanks to PEP 471, scandir is now built-in, provided in the os package. Small (untested) example:

    for dentry in os.scandir("/path/to/dir"):
        if dentry.stat().st_size > max_value:
           print("{} is biiiig".format(dentry.name))
    

    (of course you need stat at some point, but with os.walk you called stat implicitly when using the function. Also if the files have some specific extensions, you could perform stat only when the extension matches, saving even more)

    And there's more to it:

    So, as well as providing a scandir() iterator function for calling directly, Python's existing os.walk() function can be sped up a huge amount.

    So migrating to Python 3.5+ magically speeds up os.walk without having to rewrite your code.

    From my experience, multiplying the stat calls on a networked drive is catastrophic performance-wise, so if your target is a network drive, you'll benefit from this enhancement even more than local disk users.

    The best way to get performance on networked drives, though, is to run the scan tool on a machine on which the drive is locally mounted (using ssh for instance). It's less convenient, but it's worth it.

提交回复
热议问题