How to merge >1000 xml files into one using Java

后端 未结 6 1779
予麋鹿
予麋鹿 2021-02-06 06:36

I am trying to merge many xml files into one. I have successfully done that in DOM, but this solution is limited to a few files. When I run it on multiple files >1000 I am getti

相关标签:
6条回答
  • 2021-02-06 06:57

    Just do it without any xml-parsing as it doesn't seem to require any actual parsing of the xml.

    For efficiency do something like this:

    File dir = new File("/tmp/rootFiles");
    String[] files = dir.list();
    if (files == null) {
        System.out.println("No roots to merge!");
    } else {
            try (FileChannel output = new FileOutputStream("output").getChannel()) {
                ByteBuffer buff = ByteBuffer.allocate(32);
                buff.put("<rootSet>\n".getBytes()); // specify encoding too
                buff.flip();
                output.write(buff);
                buff.clear();
                for (String file : files) {
                    try (FileChannel in = new FileInputStream(new File(dir, file).getChannel()) {
                        in.transferTo(0, 1 << 24, output);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                buff.put("</rootSet>\n".getBytes()); // specify encoding too
                buff.flip();
                output.write(buff);
            } catch (IOException e) {
                e.printStackTrace();
            }
    
    0 讨论(0)
  • 2021-02-06 06:57

    For this kind of work I will suggest not to use DOM, reading the file content and making a substring is simpler and enough.

    I'm thinking of something like that :

    String rootContent = document.substring(document.indexOf("<root>"), document.lastIndexOf("</root>")+7);
    

    Then to avoid to much memory consummation. Write in the main file after every xml extraction with a BufferedWritter for example. For better performance you can also use java.nio.

    0 讨论(0)
  • 2021-02-06 06:59

    I think what you're doing is valid. The only way to make it scale to really huge numbers of files is to use a text based approach with streaming, so you never keep the whole thing in memory. But, hey! Good news. Memory is cheap these days, and 64bit JVMs are all the rage, so maybe all you need is to increase the the heap size. Try re-running your program with a -Xms1g JVM option (allocates 1Gb initial heap size).

    I also tend to use XOM for all my DOM requirements. Give it a go. Much more efficient. Don't know for sure on the memory requirements, but its orders of magnitude faster in my experience.

    0 讨论(0)
  • 2021-02-06 07:02

    DOM needs to keep the whole document in memory. If you don't need to do any special operation with your tags, I would simply use an InputStream and read all the files. If you need to do some operations, then use SAX.

    0 讨论(0)
  • 2021-02-06 07:11

    You might also consider using StAX. Here's code that would do what you want:

    import java.io.File;
    import java.io.FileWriter;
    import java.io.Writer;
    
    import javax.xml.stream.XMLEventFactory;
    import javax.xml.stream.XMLEventReader;
    import javax.xml.stream.XMLEventWriter;
    import javax.xml.stream.XMLInputFactory;
    import javax.xml.stream.XMLOutputFactory;
    import javax.xml.stream.events.XMLEvent;
    import javax.xml.transform.stream.StreamSource;
    
    public class XMLConcat {
        public static void main(String[] args) throws Throwable {
            File dir = new File("/tmp/rootFiles");
            File[] rootFiles = dir.listFiles();
    
            Writer outputWriter = new FileWriter("/tmp/mergedFile.xml");
            XMLOutputFactory xmlOutFactory = XMLOutputFactory.newFactory();
            XMLEventWriter xmlEventWriter = xmlOutFactory.createXMLEventWriter(outputWriter);
            XMLEventFactory xmlEventFactory = XMLEventFactory.newFactory();
    
            xmlEventWriter.add(xmlEventFactory.createStartDocument());
            xmlEventWriter.add(xmlEventFactory.createStartElement("", null, "rootSet"));
    
            XMLInputFactory xmlInFactory = XMLInputFactory.newFactory();
            for (File rootFile : rootFiles) {
                XMLEventReader xmlEventReader = xmlInFactory.createXMLEventReader(new StreamSource(rootFile));
                XMLEvent event = xmlEventReader.nextEvent();
                // Skip ahead in the input to the opening document element
                while (event.getEventType() != XMLEvent.START_ELEMENT) {
                    event = xmlEventReader.nextEvent();
                }
    
                do {
                    xmlEventWriter.add(event);
                    event = xmlEventReader.nextEvent();
                } while (event.getEventType() != XMLEvent.END_DOCUMENT);
                xmlEventReader.close();
            }
    
            xmlEventWriter.add(xmlEventFactory.createEndElement("", null, "rootSet"));
            xmlEventWriter.add(xmlEventFactory.createEndDocument());
    
            xmlEventWriter.close();
            outputWriter.close();
        }
    }
    

    One minor caveat is that this API seems to mess with empty tags, changing <foo/> into <foo></foo>.

    0 讨论(0)
  • 2021-02-06 07:15

    Dom does consume a lot of memory. You have, imho, the following alternatives.

    The best one is to use SAX. Using sax, only a very small amount of memory is used, cause basically nearly a single element is travelling from input to output at any given time, so memory footprint is extremely low. However, using sax is not so simple, cause compared to dom it is a bit counterintuitive.

    Try Stax, not tried myself, but it's a kind of sax on steroids easier to implement and use, cause as opposed to just receiving sax events you don't control, you actually "ask the source" to stream you the elements you want, so it fits in the middle between dom and sax, has a memory footprint similar to sax, but a more friendly paradigm.

    Sax, stax, dom are all important if you want to correctly preserve, declare etc... namespaces and other XML oddities.

    However, if you just need a quick and dirty way, which will probably be namespace compliant as well, use plain old strings and writers.

    Start outputting to the FileWriter the declaration and the root element of your "big" document. Then load, using dom if you like, each single file. Select the elements you want to end up in the "big" file, serialize them back to a string, and send the to the writer. the writer will flush to disk without using enormous amount of memory, and dom will load only one document per iteration. Unless you also have very big files on the input side, or plan to run it on a cellphone, you should not have a lot of memory problems. If dom serializes it correctly, it should preserve namespace declarations and the like, and the code will be just a bunch of lines more than the one you posted.

    0 讨论(0)
提交回复
热议问题