问题
I have a Python script that needs to process a large number of files. To get around Linux's relatively small limit on the number of arguments that can be passed to a command, I am using find -print0
with xargs -0
.
I know another option would be to use Python's glob module, but that won't help when I have a more advanced find
command, looking for modification times, etc.
When running my script on a large number of files, Python only accepts a subset of the arguments, a limitation I first thought was in argparse
, but appears to be in sys.argv
. I can't find any documentation on this. Is it a bug?
Here's a sample Python script illustrating the point:
import argparse
import sys
import os
parser = argparse.ArgumentParser()
parser.add_argument('input_files', nargs='+')
args = parser.parse_args(sys.argv[1:])
print 'pid:', os.getpid(), 'argv files', len(sys.argv[1:]), 'argparse files:', len(args.input_files)
I have a lot of files to run this on:
$ find ~/ -name "*" -print0 | xargs -0 ls > filelist
748709 filelist
But it appears xargs or Python is chunking my big list of files and processing it with several different Python runs:
$ find ~/ -name "*" -print0 | xargs -0 python test.py
pid: 4216 argv files 1819 number of files: 1819
pid: 4217 argv files 1845 number of files: 1845
pid: 4218 argv files 1845 number of files: 1845
pid: 4219 argv files 1845 number of files: 1845
pid: 4220 argv files 1845 number of files: 1845
pid: 4221 argv files 1845 number of files: 1845
...
Why are multiple processes being created to process the list? Why is it being chunked at all? I don't think there are newlines in the file names and shouldn't -print0
and -0
take care of that issue? If there were newlines, I'd expect sed -n '1810,1830p' filelist
to show some weirdness for the above example. What gives?
I almost forgot:
$ python -V
Python 2.7.2+
回答1:
xargs
will chunk your arguments by default. Have a look at the --max-args
and --max-chars
options of xargs
. Its man page also explains the limits (under --max-chars
).
回答2:
Everything that you want from find
is available from os.walk
.
Don't use find
and the shell for any of this.
Use os.walk
and write all your rules and filters in Python.
"looking for modification times" means that you'll be using os.stat
or some similar library function.
回答3:
Python does not seem to place a limit on the number of arguments but the operating system does.
Have a look here for a more comprehensive discussion.
回答4:
xargs will pass as much as it can, but there's still a limit. For instance,
find ~/ -name "*" -print0 | xargs -0 wc -l | grep total
will give you multiple lines of output.
You probably want to have your script either take a file containing a list of filenames, or accept filenames on its stdin.
来源:https://stackoverflow.com/questions/9103023/is-python-sys-argv-limited-in-the-maximum-number-of-arguments