As the title says it, I have a huge xml file (GBs)
...
...
I made good experiences with STX (Streaming Transformations for XML). Basically, it is a streamed version of XSLT, well suited to parsing huge amounts of data with minimal memory footprint. It has an implementation in Java named Joost.
It should be easy to come up with a STX transform that ignores all elements until the element matches a given XPath, copies that element and all its children (using an identity template within a template group), and continues to ignore elements until the next match.
UPDATE
I hacked together a STX transform that does what I understand you want. It mostly depends on STX-only features like template groups and configurable default templates.
<stx:transform xmlns:stx="http://stx.sourceforge.net/2002/ns"
version="1.0" pass-through="none" output-method="xml">
<stx:template match="element/child">
<stx:process-self group="copy" />
</stx:template>
<stx:group name="copy" pass-through="all">
</stx:group>
</stx:transform>
The pass-through="none"
at the stx:transform
configures the default templates (for nodes, attributes etc.) to produce no output, but process child elements. Then the stx:template
matches the XPath element/child
(this is the place where you put your match expression), it "processes self" in the "copy" group, meaning that the matching template from the group name="copy"
is invoked on the current element. That group has pass-though="all"
, so the default templates copy their input and process child elements. When the element/child
element is ended, control is passed back to the template that invoked process-self
, and the following elements are ignored again. Until the template matches again.
The following is an example input file:
<root>
<child attribute="no-parent, so no copy">
</child>
<element id="id1">
<child attribute="value1">
text1<b>bold</b>
</child>
</element>
<element id="id2">
<child attribute="value2">
text2
<x:childX xmlns:x="http://x.example.com/x">
<!-- comment -->
yet more<b i="i" x:i="x-i" ></b>
</x:childX>
</child>
</element>
</root>
This is the corresponding output file:
<?xml version="1.0" encoding="UTF-8"?>
<child attribute="value1">
text1<b>bold</b>
</child><child attribute="value2">
text2
<x:childX xmlns:x="http://x.example.com/x">
<!-- comment -->
yet more<b i="i" x:i="x-i" />
</x:childX>
</child>
The unusual formatting is a result of skipping the text nodes containing newlines outside the child
elements.
Have a look at StAX, this might be what you need. There's a good introduction on IBM Developer Works.
For such a large XML document, something with a streaming architecture, like Omnimark would be ideal.
It wouldn't have to be anything complex either. An Omnimark script like what's below could give you what you need:
process
submit #main-input
macro upto (arg string) is
((lookahead not string) any)*
macro-end
find (("<keep") upto ("</keep>") "</keep>")=>keep
output keep
find any
Since you're talking about GB's, I would rather prioritize the memory usage in the consideration. SAX needs about 2 times of memory as the document big is, while DOM needs it to be at least 5 times. So if your XML file is 1GB big, then DOM would require a minimum of 5GB of free memory. That's not funny anymore. So SAX (or any variant on it, like StAX) is the best option here.
If you want the most memory efficient approach, look at VTD-XML. It requires only a little more memory than the file big is.
Yes, just write a SAX content handler, and when it encounters a certain element, you build a dom tree on that element. I've done this with very large files, and it works very well.
It's actually very easy: As soon as you encounter the start of the element you want, you set a flag in your content handler, and from there on, you forward everything to the DOM builder. When you encounter the end of the element, you set the flag to false, and write out the result.
(For more complex cases with nested elements of the same element name, you'll need to create a stack or a counter, but that's still quite easy to do.)
StAX would seem to be one obvious solution: it's a pull parser rather than either the "push" of SAX or the "buffer the whole thing" approach of DOM. Can't say I've used it though. A "StAX tutorial" search may come in handy :)