Quicker (quickest?) way to get number of files in a directory with over 200,000 files

前端 未结 10 1422
Happy的楠姐
Happy的楠姐 2021-02-04 07:35

I have some directories containing test data, typically over 200,000 small (~4k) files per directory.

I am using the following C# code to get the number of files in a di

相关标签:
10条回答
  • 2021-02-04 07:52

    Create an index every day at midnight. Finding a file will go very fast then. And counting the number of files is just as trivial.

    If I see it right, you have one dir for each day. If all files you receive today go in the map of today then this system can be improved. Just index the directory of the previous day at midnight.

    0 讨论(0)
  • 2021-02-04 07:54

    If I'm using a slowish high-level language, and portability wasn't a big concern, I'd be tempted to try calling an external program (eg `ls | wc`.first.to_i if using ruby and unix), but then I'd check whether it does the job any better.

    0 讨论(0)
  • 2021-02-04 08:04

    Not using the System.IO.Directory namespace, there isn't. You'll have to find a way of querying the directory that doesn't involve creating a massive list of files.

    This seems like a bit of an oversight from Microsoft, the Win32 APIs have always had functions that could count files in a directory.

    You may also want to consider splitting up your directory. How you manage a 200,000-file directory is beyond me :-)

    Update:

    John Saunders raises a good point in the comments. We already know that (general purpose) file systems are not well equipped to handle this level of storage. One thing that is equipped to handle huge numbers of small "files" is a database.

    If you can identify a key for each (containing, for example, date, hour and customer number), these files should be injected into a database. The 4K record size and 108 million rows (200,000 rows/day * 30 days/month * 18 months) should be easily handled by most professional databases. I know that DB2/z would chew on that for breakfast.

    Then, when you need some test data extracted to files, you have a script/program which just extracts the relevant records onto the file system. Then run your tests to successful completion and delete the files.

    That should make your specific problem quite easy to do:

    select count(*) from test_files where directory_name = '/SomeDirectory'
    

    assuming you have an index on directory_name, of course.

    0 讨论(0)
  • 2021-02-04 08:04

    FYI, .NET 4 includes a new method, Directory.EnumerateFiles, that does exactly what you need is awesome. Chances are you're not using .NET 4, but it's worth remembering anyway!

    Edit: I now realise that the OP wanted the NUMBER of files. However, this method is so useful I'm keeping this post here.

    0 讨论(0)
  • 2021-02-04 08:05

    The code you've got is slow because it first gets an array of all the available files, then takes the length of that array.

    However, you're almost certainly not going to find any solutions that work much faster than that.

    Why?

    Access controls.

    Each file in a directory may have an access control list - which may prevent you from seeing the file at all.

    The operating system itself can't just say "hey, there are 100 file entries here" because some of them may represent files you're not allowed to know exist - they shouldn't be shown to you at all. So the OS itself has to iterate over the files, checking access permissions file by file.

    For a discussion that goes into more detail around this kind of thing, see two posts from The Old New Thing:

    • Why doesn't the file system have a function that tells you the number of files in a directory?
    • Why doesn't Explorer show recursive directory size as an optional column?

    [As an aside, if you want to improve performance of a directory containing a lot of files, limit yourself to strictly 8.3 filenames. No I'm not kidding - it's faster, because the OS doesn't have to generate an 8.3 filename itself, and because the algorithm used is braindead. Try a benchmark and you'll see.]

    0 讨论(0)
  • 2021-02-04 08:06

    You could use System.Management and WMI's class "cim_datafile", just run the following query in WMI, you can also use Linq to Wmi but i didn't try it

    select * from cim_datafile where drive='c:' and path='\\SomeDirectory\\' 
    

    I guess it will work faster

    0 讨论(0)
提交回复
热议问题