I am trying to use Python to find a faster way to sift through a large directory(approx 1.1TB) containing around 9 other directories and finding files larger than, say, 200G
It is hard to imagine that you will find a significantly faster way to traverse a directory than os.walk()
and du
. Parallelizing the search might help a bit in some setups (e.g. SSD), but it won't make a dramatic difference.
A simple approach to make things faster is by automatically running the script in the background every hour or so, and having your actual script just pick up the results. This won't help if the results need to be current, but might work for many monitoring setups.
That's not true that you cannot do better than os.walk()
scandir
is said to be 2 to 20 times faster.
From https://pypi.python.org/pypi/scandir
Python’s built-in os.walk() is significantly slower than it needs to be, because – in addition to calling listdir() on each directory – it calls stat() on each file to determine whether the filename is a directory or not. But both FindFirstFile / FindNextFile on Windows and readdir on Linux/OS X already tell you whether the files returned are directories or not, so no further stat system calls are needed. In short, you can reduce the number of system calls from about 2N to N, where N is the total number of files and directories in the tree.
In practice, removing all those extra system calls makes os.walk() about 7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X. So we’re not talking about micro-optimizations.
From python 3.5, thanks to PEP 471, scandir
is now built-in, provided in the os
package. Small (untested) example:
for dentry in os.scandir("/path/to/dir"):
if dentry.stat().st_size > max_value:
print("{} is biiiig".format(dentry.name))
(of course you need stat
at some point, but with os.walk
you called stat
implicitly when using the function. Also if the files have some specific extensions, you could perform stat
only when the extension matches, saving even more)
And there's more to it:
So, as well as providing a scandir() iterator function for calling directly, Python's existing os.walk() function can be sped up a huge amount.
So migrating to Python 3.5+ magically speeds up os.walk
without having to rewrite your code.
From my experience, multiplying the stat
calls on a networked drive is catastrophic performance-wise, so if your target is a network drive, you'll benefit from this enhancement even more than local disk users.
The best way to get performance on networked drives, though, is to run the scan tool on a machine on which the drive is locally mounted (using ssh
for instance). It's less convenient, but it's worth it.