What XML-parser do you recommend for the following purpose:
The XML-file (formatted, containing whitespaces) is around 800 MB. It mostly contains three types of tag
What XML-parser do you recommend for the following purpose: The XML-file (formatted, containing whitespaces) is around 800 MB.
Perhaps you should take a look at VTD-XML: http://en.wikipedia.org/wiki/VTD-XML (see http://sourceforge.net/projects/vtd-xml/ for download)
It mostly contains three types of tag (let's call them n, w and r). They have an attribute called id which i'd have to search for, as fast as possible.
I know it's blasphemy but have you considered awk or grep to preprocess? I mean, I know you can't actually parse xml and detect errors in nested structures like XML with that, but perhaps your XML is in such a form that it might just happens to be possible?
I know that XSLT could be used. Or are there any easy alternatives?
As far as I know XSLT processors operate on a DOM tree of the source document...so they'd need to parse and load the entire document into memory...probably not a good idea for a document this large (or perhaps you have enough memory for that?) There is something called streaming XSLT but I think the technique is quite young and there aren't many implementations around, none free AFAIK so you could try.
xslt tends to be comparatively quite fast even for large files. For large files, the trick is not creating the DOM first. Use a URL Source or a stream source to pass to the transformer.
To strip the empty nodes and unwanted attributes start with the Identity Transform template and filter them out. Then use XPATH to search for your required tags.
You could also try a bunch of variations:
Split the large XML files into smaller ones and still preserve their composition using the XML-Include. It is very much similar to splitting large source files into smaller ones and using the include "x.h" kind of concept. This way, you may not have to deal with large files.
When you run your XML through the Identity Transform, use it to assign a UNID for each node of interest using the generated-id() function.
Build a front-end database table for searching. Use the above generated UNID to quickly pinpoint the location of the data in a file.
"I could split it into three files"
Try XmlSplit. it is a commandline program with options for specifying where to split by element, attribute, etc. Google and you should find it. Very fast too.
I'm using XMLStarlet ( http://xmlstar.sourceforge.net/ ) for working with huge XML files. There are versions for both linux and windows.
Large XML files and Java heap space are a vexed issue. StAX works on big files - it certainly handles 1GB without batting an eyelid. There's a useful article on the subject of using StAx here: XML.com which got me up and running with it in about 20 minutes.
As Bouman has pointed out, treating this as pure text processing will give you the best possible speed.
To process this as XML the only practical way is to use a SAX parser. The Java APIs build in SAX parser is perfectly capable of handling this so there is no need to install any third party libraries.