Reading Huge XML File using StAX and XPath

前端未结

关注

 7  1931

The input file contains thousands of transactions in XML format which is around 10GB of size. The requirement is to pick each transaction XML based on the user input and sen

相关标签:

7条回答

傲寒

2020-12-31 12:12

Streaming Transformations for XML (STX) might be what you need.

0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2020-12-31 12:19
A fun solution for processing huge XML files >10GB.
1. Use ANTLR to create byte offsets for the parts of interest. This will save some memory compared with a DOM based approach.
2. Use Jaxb to read parts from byte position
Find details at the example of wikipedia dumps (17GB) in this SO answer https://stackoverflow.com/a/43367629/1485527
0 讨论(0)
发布评论:

提交评论
- 加载中...
北荒

2020-12-31 12:29
If performance is an important factor, and/or the document size is large (both of which seem to be the case here), the difference between an event parser (like SAX or StAX) and the native Java XPath implementation is that the latter builds a W3C DOM Document prior to evaluating the XPath expression. [It's interesting to note that all Java Document Object Model implementations like the DOM or Axiom use an event processor (like SAX or StAX) to build the in-memory representation, so if you can ever get by with only the event processor you're saving both memory and the time it takes to build a DOM.]

As I mentioned, the XPath implementation in the JDK operates upon a W3C DOM Document. You can see this in the Java JDK source code implementation by looking at com.sun.org.apache.xpath.internal.jaxp.XPathImpl, where prior to the evaluate() method being called the parser must first parse the source:
```
  Document document = getParser().parse( source );
```
After this your 10GB of XML will be represented in memory (plus whatever overhead) — probably not what you want. While you may want a more "generic" solution, both your example XPath and your XML markup seem relatively simple, so there doesn't seem to be a really strong justification for an XPath (except perhaps programming elegance). The same would be true for the XProc suggestion: this would also build a DOM. If you truly need a DOM you could use Axiom rather than the W3C DOM. Axiom has a much friendlier API and builds its DOM over StAX, so it's fast, and uses Jaxen for its XPath implementation. Jaxen requires some kind of DOM (W3C DOM, DOM4J, or JDOM). This will be true of all XPath implementations, so if you don't truly need XPath sticking with just the events parser would be recommended.

SAX is the old streaming API, with StAX newer, and a great deal faster. Either using the native JDK StAX implementation (javax.xml.stream) or the Woodstox StAX implementation (which is significantly faster, in my experience), I'd recommend creating a XML event filter that first matches on element type name (to capture your <txn> elements). This will create small bursts of events (element, attribute, text) that can be checked for your matching user values. Upon a suitable match you could either pull the necessary information from the events or pipe the bounded events to build a mini-DOM from them if you found the result was easier to navigate. But it sounds like that might be overkill if the markup is simple.

This would likely be the simplest, fastest possible approach and avoid the memory overhead of building a DOM. If you passed the names of the element and attribute to the filter (so that your matching algorithm is configurable) you could make it relatively generic.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉梦人生

2020-12-31 12:29

We regularly parse 1GB+ complex XML files by using a SAX parser which does exactly what you described: It extracts partial DOM trees that can be conveniently queried using XPATH.

I blogged about it here - It's using a SAX not a StAX parser, but may be worth a look at.

0 讨论(0)
发布评论:

提交评论
- 加载中...
逝去的感伤

2020-12-31 12:31
It's definitely a use case for XProc with a streaming and parallel processing implementation like QuiXProc (http://code.google.com/p/quixproc)

In this situation, you will have to use
```
  <p:for-each>
    <p:iteration-source select="//transactions/txn"/>
    
  </p:for-each>
```
You can even wrapp each of the resulting transformation with a single line of XProc
```
  <p:wrap-sequence wrapper="transactions"/>
```
Hope this helps
0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2020-12-31 12:32

Do you need to process it fast or you need fast lookups in the data ? These requirements need different approach.

For fast reading of the whole data StAX will be OK.

If you need fast lookups than you could need to load it to some database, Berkeley DB XML e.g.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页