Choosing a random file from a directory (with a large number of files) in Python

前端 未结 5 516
南笙
南笙 2021-02-07 20:58

I have a directory with a large number of files (~1mil). I need to choose a random file from this directory. Since there are so many files, os.listdir naturally tak

相关标签:
5条回答
  • 2021-02-07 21:22

    You may be able to get this running:

    http://mail.python.org/pipermail/python-list/2009-July/1213182.html

    And that's probably the best possible solution to your problem, but only where n is small - if n goes large then os.listdir is probably just as good for your purpose.

    I've hunted around and failed to find any other way to open a file in a directory. If I had more time I'd be inclined to play around a bit and generate my own ~1mil files.


    I just thought of another way to do this: Assuming the files are constant - you're not getting any more or less - you could keep a list of the filenames in a sqlite database. Then it would be relatively simple to query the database for a name by a random ROWID. I don't know if you'll still be plagued by the long time to search for the correct file, but at least getting a filename should take a short amount.

    Of course if the files in the directory are randomly named, you can rename the files(?) and put them into a directory structure like AdamK suggests.

    0 讨论(0)
  • 2021-02-07 21:28

    try this, (here is very fast with 50K files...)

    import glob
    import random
    
    list = glob.glob("*/*.*")
    print list[random.randrange(0,list.__len__())]
    
    0 讨论(0)
  • 2021-02-07 21:33

    I have a similar need to the OP.

    I think I will adopt a method of precaching: you store in a .txt file the list of all the files, then you can just do a clever seeking of a random line in your listing (without even having to load it in memory), and you're done!

    Of course, you still have to update the cache, and more importantly define when you need to update the cache, but depending on your needs, it may be easy (just after a specific action, or when something changed, etc..).

    A code to cleverly read a random line from a file, in Python, by Jonathan Kupferman:

    http://www.regexprn.com/2008/11/read-random-line-in-large-file-in.html

    0 讨论(0)
  • 2021-02-07 21:38

    Alas, I don't think there is a solution to your problem. One, I don't know of portable API that will return you the number of entries in directory (w/o enumerating them first). Two, I don't think there is API to return you directory entry by number and not by name.

    So overall, a program will have to enumerate O(n) directory entries to get a single random one. The trivial approach of determining number of entries and then picking one will either require enough RAM to hold the full listing (os.listdir()) or will have to enumerate 2nd time the directory to find the random(n) item - overall n+n/2 operations on average.

    There is slightly better approach - but only slightly - see randomly-selecting-lines-from-files. In short there is a way to pick random item from list/iterator with unknown length, while reading one item at a time and ensure that any item may be picked with equal probability. But this won't help with os.listdir() because it already returns list in memory that already contains all 1M+ entries - so you can as well ask it about len() ...

    0 讨论(0)
  • 2021-02-07 21:40

    I'm not sure this is even possible. Even at the VFS or filesystem level, there is no guarantee that a directory entry count is even maintained. For instance many filesystems simply record the combined byte size of the directory entry structures contained in a given directory.

    Estimation may be made if directory entries are fixed size structures, but this is uncommon now (consider LFN for FAT32). Even if a given filesystem did provide an entry count without needing to iterate through a directory, or if the VFS cached a record of a directories length, these would definitely be operating system, filesystem, and kernel specific.

    0 讨论(0)
提交回复
热议问题