I\'ve to calculate the time-complexity of an algorithm, but in it I\'m calling os.walk which I can\'t consider as a single operation but many.
The sources of os.wal
Well... let's walk through the source :)
Docs: http://docs.python.org/2/library/os.html#os.walk
def walk(top, topdown=True, onerror=None, followlinks=False):
islink, join, isdir = path.islink, path.join, path.isdir
try:
# Note that listdir and error are globals in this module due
# to earlier import-*.
# Should be O(1) since it's probably just reading your filesystem journal
names = listdir(top)
except error, err:
if onerror is not None:
onerror(err)
return
dirs, nondirs = [], []
# O(n) where n = number of files in the directory
for name in names:
if isdir(join(top, name)):
dirs.append(name)
else:
nondirs.append(name)
if topdown:
yield top, dirs, nondirs
# Again O(n), where n = number of directories in the directory
for name in dirs:
new_path = join(top, name)
if followlinks or not islink(new_path):
# Generator so besides the recursive `walk()` call, no additional cost here.
for x in walk(new_path, topdown, onerror, followlinks):
yield x
if not topdown:
yield top, dirs, nondirs
Since it's a generator it all depends on how far you walk the tree, but it looks like O(n)
where n
is the total number of files/directories in the given path.
os.walk
(unless you prune it, or have symlink issues) guarantees to list each directory in the subtree exactly once.
So, if you assume that listing a directory is linear on the number of entries in the directory,* then if there are N total directory entries in your subtree, os.walk
will take O(N) time.
Or, if you want the time for walk
to produce each value (the root, dirnames, filenames tuple): if those N directory entries are split among M subdirectories, then each of the M iterations takes amortized O(N/M) time.
* Really, that's up to your OS, C library, and filesystem not Python, and it can be much worse than O(N) for older filesystems… but let's ignore that.
This is too long for a comment: in CPython, a yield
passes its result to the immediate caller, not directly to the ultimate consumer of the result. So, if you have recursion going R
levels deep, a chain of yield
s at each level delivering a result back up the call stack to the ultimate consumer takes O(R)
time. It also takes O(R)
time to resume the R
levels of recursive call to get back to the lowest level where the first yield
occurred.
So each result yield
'ed by walk()
takes time proportional to the level in the directory tree at which the result is first yield
'ed.
That's the theoretical ;-) truth. In practice, however, this makes approximately no difference unless the recursion is very deep. That's because the chain of yield
s, and the chain of generator resumptions, occurs "at C speed". In other words, it does take O(R)
time, but the constant factor is so small most programs never notice this.
This is especially true of recursive generators like walk()
, which almost never recurse deeply. Who has a directory tree nested 100 levels? Nope, me neither ;-)