Is there a workaround for Java's poor performance on walking huge directories?

前端未结

关注

 10  763

I am trying to process files one at a time that are stored over a network. Reading the files is fast due to buffering is not the issue. The problem I have is just listing

相关标签:

10条回答

挽巷

2020-12-03 00:24

If you're on Java 1.5 or 1.6, shelling out "dir" commands and parsing the standard output stream on Windows is a perfectly acceptable approach. I've used this approach in the past for processing network drives and it has generally been a lot faster than waiting for the native java.io.File listFiles() method to return.

Of course, a JNI call should be faster and potentially safer than shelling out "dir" commands. The following JNI code can be used to retrieve a list of files/directories using the Windows API. This function can be easily refactored into a new class so the caller can retrieve file paths incrementally (i.e. get one path at a time). For example, you can refactor the code so that FindFirstFileW is called in a constructor and have a seperate method to call FindNextFileW.

JNIEXPORT jstring JNICALL Java_javaxt_io_File_GetFiles(JNIEnv *env, jclass, jstring directory)
{
    HANDLE hFind;
    try {

      //Convert jstring to wstring
        const jchar *_directory = env->GetStringChars(directory, 0);
        jsize x = env->GetStringLength(directory);
        wstring path;  //L"C:\\temp\\*";
        path.assign(_directory, _directory + x);
        env->ReleaseStringChars(directory, _directory);

        if (x<2){
            jclass exceptionClass = env->FindClass("java/lang/Exception");
            env->ThrowNew(exceptionClass, "Invalid path, less than 2 characters long.");
        }

        wstringstream ss;
        BOOL bContinue = TRUE;
        WIN32_FIND_DATAW data;
        hFind = FindFirstFileW(path.c_str(), &data);
        if (INVALID_HANDLE_VALUE == hFind){
            jclass exceptionClass = env->FindClass("java/lang/Exception");
            env->ThrowNew(exceptionClass, "FindFirstFileW returned invalid handle.");
        }


        //HANDLE hStdOut = GetStdHandle(STD_OUTPUT_HANDLE);
        //DWORD dwBytesWritten;


        // If we have no error, loop thru the files in this dir
        while (hFind && bContinue){

          /*
          //Debug Print Statment. DO NOT DELETE! cout and wcout do not print unicode correctly.
            WriteConsole(hStdOut, data.cFileName, (DWORD)_tcslen(data.cFileName), &dwBytesWritten, NULL);
            WriteConsole(hStdOut, L"\n", 1, &dwBytesWritten, NULL);
            */

          //Check if this entry is a directory
            if (data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY){
                // Make sure this dir is not . or ..
                if (wstring(data.cFileName) != L"." &&
                    wstring(data.cFileName) != L"..")
                {   
                    ss << wstring(data.cFileName) << L"\\" << L"\n";
                }
            }
            else{
                ss << wstring(data.cFileName) << L"\n";
            }
            bContinue = FindNextFileW(hFind, &data);
        }   
        FindClose(hFind); // Free the dir structure



        wstring cstr = ss.str();
        int len = cstr.size();
        //WriteConsole(hStdOut, cstr.c_str(), len, &dwBytesWritten, NULL);
        //WriteConsole(hStdOut, L"\n", 1, &dwBytesWritten, NULL);
        jchar* raw = new jchar[len];
        memcpy(raw, cstr.c_str(), len*sizeof(wchar_t));
        jstring result = env->NewString(raw, len);
        delete[] raw;
        return result;
    }
    catch(...){
        FindClose(hFind);
        jclass exceptionClass = env->FindClass("java/lang/Exception");
        env->ThrowNew(exceptionClass, "Exception occured.");
    }

    return NULL;
}

Credit: https://sites.google.com/site/jozsefbekes/Home/windows-programming/miscellaneous-functions

Even with this approach, there are still efficiencies to be gained. If you serialize the path to a java.io.File, there is a huge performance hit - especially if the path represents a file on a network drive. I have no idea what Sun/Oracle is doing under the hood but if you need additional file attributes other than the file path (e.g. size, mod date, etc), I have found that the following JNI function is much faster than instantiating a java.io.File object on a network the path.

JNIEXPORT jlongArray JNICALL Java_javaxt_io_File_GetFileAttributesEx(JNIEnv *env, jclass, jstring filename)
{   

  //Convert jstring to wstring
    const jchar *_filename = env->GetStringChars(filename, 0);
    jsize len = env->GetStringLength(filename);
    wstring path;
    path.assign(_filename, _filename + len);
    env->ReleaseStringChars(filename, _filename);


  //Get attributes
    WIN32_FILE_ATTRIBUTE_DATA fileAttrs;
    BOOL result = GetFileAttributesExW(path.c_str(), GetFileExInfoStandard, &fileAttrs);
    if (!result) {
        jclass exceptionClass = env->FindClass("java/lang/Exception");
        env->ThrowNew(exceptionClass, "Exception Occurred");
    }

  //Create an array to store the WIN32_FILE_ATTRIBUTE_DATA
    jlong buffer[6];
    buffer[0] = fileAttrs.dwFileAttributes;
    buffer[1] = date2int(fileAttrs.ftCreationTime);
    buffer[2] = date2int(fileAttrs.ftLastAccessTime);
    buffer[3] = date2int(fileAttrs.ftLastWriteTime);
    buffer[4] = fileAttrs.nFileSizeHigh;
    buffer[5] = fileAttrs.nFileSizeLow;

    jlongArray jLongArray = env->NewLongArray(6);
    env->SetLongArrayRegion(jLongArray, 0, 6, buffer);
    return jLongArray;
}

You can find a full working example of this JNI-based approach in the javaxt-core library. In my tests using Java 1.6.0_38 with a Windows host hitting a Windows share, I have found this JNI approach approximately 10x faster then calling java.io.File listFiles() or shelling out "dir" commands.

0 讨论(0)

挽巷

2020-12-03 00:27

Are you sure it's due to Java, not just a general problem with having 10k entries in one directory, particularly over the network?

Have you tried writing a proof-of-concept program to do the same thing in C using the win32 findfirst/findnext functions to see whether it's any faster?

I don't know the ins and outs of SMB, but I strongly suspect that it needs a round trip for every file in the list - which is not going to be fast, particularly over a network with moderate latency.

Having 10k strings in an array sounds like something which should not tax the modern Java VM too much either.

0 讨论(0)
发布评论:

提交评论
- 加载中...
说谎

2020-12-03 00:30

Using an Iterable doesn't imply that the Files will be streamed to you. In fact its usually the opposite. So an array is typically faster than an Iterable.

0 讨论(0)
发布评论:

提交评论
- 加载中...
时光说笑

2020-12-03 00:34
An alternative is to have the files served over a different protocol. As I understand you're using SMB for that and java is just trying to list them as a regular file.

The problem here might not be java alone ( how does it behaves when you open that directory with Microsoft Explorer x:\shared ) In my experience it also take a considerably amount of time.

You can change the protocol to something like HTTP, only to fetch the file names. This way you can retrieve the list of files over http ( 10k lines should't be too much ) and let the server deal with file listing. This would be very fast, since it will run with local resources ( those in the server )

Then when you have the list, you can process them one by exactly the way you're doing right now.

The keypoint is to have an aid mechanism in the other side of the node.

Is this feasible?

Today:
```
File [] content = new File("X:\\remote\\dir").listFiles();

for ( File f : content ) {
    process( f );
}
```
Proposed:
```
String [] content = fetchViaHttpTheListNameOf("x:\\remote\\dir");

for ( String fileName : content ) {
    process( new File( fileName ) );
}
```
The http server could be a very small small and simple file.

If this is the way you have it right now, what you're doing is to fetch all the 10k files information to your client machine ( I don't know how much of that info ) when you only need the file name for later processing.

If the processing is very fast right now it may be slowed down a bit. This is because the information prefetched is no longer available.

Give it a try.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2