java.lang.OutOfMemoryError while transforming XML in a huge directory

后端 未结 4 1280
闹比i
闹比i 2021-01-05 19:58

I want to transform XML files using XSLT2, in a huge directory with a lot of levels. There are more than 1 million files, each file is 4 to 10 kB. After a while I always rec

相关标签:
4条回答
  • 2021-01-05 20:28

    Try this one

    String[] files = dir.list();
    for (String fileName : files) {
        File file = new File(fileName);
        if (file.isDirectory()) {
            pushDocuments(file);
        } else {
            indexFiles.index(file);
        }
    }
    
    0 讨论(0)
  • 2021-01-05 20:31

    My usual recommendation with the Saxon s9api interface is to reuse the XsltExecutable object, but to create a new XsltTransformer for each transformation. The XsltTransformer caches documents you have read in case they are needed again, which is not what you want in this case.

    As an alternative, you could call xsltTransformer.getUnderlyingController().clearDocumentPool() after each transformation.

    (Please note, you can ask Saxon questions at saxonica.plan.io, which gives a good chance we [Saxonica] will notice them and answer them. You can also ask them here and tag them "saxon", which means we'll probably respond to the question at some point, though not always immediately. If you ask on StackOverflow with no product-specific tags, it's entirely hit-and-miss whether anyone will notice the question.)

    0 讨论(0)
  • 2021-01-05 20:41

    I had a similar problem that came from the javax.xml.transform package that used a ThreadLocalMap to cache the XML chunks that were read during XSLT. I Had to outsource the XSLT into its own Thread so that the ThreadLocalMap cleared when the new Thread died - this freed the memory. See here: https://www.ahoi-it.de/ahoi/news/java-xslt-memory-leak/1446

    0 讨论(0)
  • 2021-01-05 20:44

    I would check you don't have a memory leak. The number of files shouldn't matter as you are only processing one at at time and as long as you can process the largest file you should be able to process them all.

    I suggest you run jstat -gc {pid} 10s while the program is running to look for memory leaks. What you should look for is the size of memory after a Full GC, if this is ever increasing, use the VisualVM memory profiler to work out why. Or use jmap -histo:live {pid} | head -20 for a hint.

    If the memory is not increasing you have a file which is triggering the out of memory. This is because either a) the file is much bigger than the others, or uses much more memory b) it triggers a bug in the library.

    0 讨论(0)
提交回复
热议问题