Limitation to Python's glob?

流过昼夜 提交于 2019-12-11 01:48:07

问题


I'm using glob to feed file names to a loop like so:

inputcsvfiles = glob.iglob('NCCCSM*.csv')

for x in inputcsvfiles:

    csvfilename = x
    do stuff here

The toy example that I used to prototype this script works fine with 2, 10, or even 100 input csv files, but I actually need it to loop through 10,959 files. When using that many files, the script stops working after the first iteration and fails to find the second input file.

Given that the script works absolutely fine with a "reasonable" number of entries (2-100), but not with what I need (10,959) is there a better way to handle this situation, or some sort of parameter that I can set to allow for a high number of iterations?

PS- initially I was using glob.glob, but glob.iglob fairs no better.

Edit:

An expansion of above for more context...

    # typical input file looks like this: "NCCCSM20110101.csv", "NCCCSM20110102.csv", etc.   
    inputcsvfiles = glob.iglob('NCCCSM*.csv')

    # loop over individial input files    
      for x in inputcsvfiles:

        csvfile = x
        modelname = x[0:5]

        # ArcPy
        arcpy.AddJoin_management(inputshape, "CLIMATEID", csvfile, "CLIMATEID", "KEEP_COMMON")

        do more stuff after

The script fails at the ArcPy line, where the "csvfile" variable gets passed into the command. The error reported is that it can't find a specified csv file (e.g., "NCCSM20110101.csv"), when in fact, the csv is definitely in the directory. Could it be that you can't reuse a declared variable (x) multiple times as I have above? Again, this will work fine if the directory being glob'd only has 100 or so files, but if there's a whole lot (e.g., 10,959), it fails seemingly arbitrarily somewhere down the list.


回答1:


One issue that arose was not with Python per se, but rather with ArcPy and/or MS handling of CSV files (more the latter, I think). As the loop iterates, it creates a schema.ini file whereby information on each CSV file processed in the loop gets added and stored. Over time, the schema.ini gets rather large and I believe that's when the performance issues arise.

My solution, although perhaps inelegant, was do delete the schema.ini file during each loop to avoid the issue. Doing so allowed me to process the 10k+ CSV files, although rather slowly. Truth be told, we wound up using GRASS and BASH scripting in the end.




回答2:


If it works for 100 files but fails for 10000, then check that arcpy.AddJoin_management closes csvfile after it is done with it.

There is a limit on the number of open files that a process may have at any one time (which you can check by running ulimit -n).




回答3:


Try doing a ls * on shell for those 10,000 entries and shell would fail too. How about walking the directory and yield those files one by one for your purpose?

#credit - @dabeaz - generators tutorial

import os
import fnmatch

def gen_find(filepat,top):
    for path, dirlist, filelist in os.walk(top):
        for name in fnmatch.filter(filelist,filepat):
            yield os.path.join(path,name)

# Example use

if __name__ == '__main__':
    lognames = gen_find("NCCCSM*.csv",".")
    for name in lognames:
        print name


来源:https://stackoverflow.com/questions/11675301/limitation-to-pythons-glob

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!