OutOfMemoryError when parsing XML in Clojure with data.zip

问题

I want to use Clojure to extract the titles from a Wiktionary XML dump.

I used head -n10000 > out-10000.xml to create smaller versions of the original monster file. Then I trimmed with a text editor to make it valid XML. I renamed the files according to the number of lines inside (wc -l):

(def data-9764 "data/wiktionary-en-9764.xml") ; 354K
(def data-99224 "data/wiktionary-en-99224.xml") ; 4.1M
(def data-995066 "data/wiktionary-en-995066.xml") ; 34M
(def data-7999931 "data/wiktionary-en-7999931.xml") ; 222M

Here is the overview of the XML structure:

<mediawiki>
  <page>
    <title>dictionary</title>
    <revision>
      <id>20100608</id>
      <parentid>20056528</parentid>
      <timestamp>2013-04-06T01:14:29Z</timestamp>
      <text xml:space="preserve">
        ...
      </text>
    </revision>
  </page>
</mediawiki>

Here is what I've tried, based on this answer to 'Clojure XML Parsing':

(ns example.core
  (:use [clojure.data.zip.xml :only (attr text xml->)])
  (:require [clojure.xml :as xml]
            [clojure.zip :as zip]))

(defn titles
  "Extract titles from +filename+"
  [filename]
  (let [xml (xml/parse filename)
        zipped (zip/xml-zip xml)]
    (xml-> zipped :page :title text)))

(count (titles data-9764))
; 38

(count (titles data-99224))
; 779

(count (titles data-995066))
; 5172

(count (titles data-7999931))
; OutOfMemoryError Java heap space  java.util.Arrays.copyOfRange (Arrays.java:3209)

Am I doing something wrong in my code? Or is this perhaps a bug or limitation in the libraries I'm using? Based on REPL experimentation, it seems like the code I'm using is lazy. Underneath, Clojure uses a SAX XML parser, so that alone should not be the problem.

回答1:

It's a limitation of the zipper data structure. Zippers are designed for efficiently navigating trees of various sorts, with support for moving up/down/left/right in the tree hierarchy, with in-place edits in near-constant time.

From any position in the tree, the zipper needs to be able to re-construct the original tree (with edits applied). To do that, it keeps track of the current node, the parent node, and all siblings to the left and right of the current node in the tree, making heavy use of persistent data structures.

The filter functions that you're using start at the left-most child of a node and work their way one-by-one to the right, testing predicates along the way. The zipper for the left-most child starts out with an empty vector for its left-hand siblings (note the :l [] part in the source for zip/down). Each time you move right, it will add the last node visited to the vector of left-hand siblings (:l (conj l node) in zip/right). By the time you arrive at the right-most child, you've built up an in-memory vector of all the nodes in that level in the tree, which, for a wide tree like yours, could cause an OOM error.

As a workaround, if you know that the top-level element is just a container for a list of <page> elements, I'd suggest using the zipper to navigate within the page elements and just use map to process the pages:

(defn titles
  "Extract titles from +filename+"
  [filename]
  (let [xml (xml/parse filename)]
    (map #(xml-> (zip/xml-zip %) :title text)
         (:content xml))))

So, basically, we're avoiding using the zip abstraction for the top level of the overall xml input, and thusly avoid its holding the entire xml in memory. This implies that for even huger xml, where each first-level child is huge, we may have to skip using the zipper once again in the second level of the XML structure, and so on...

回答2:

Looking at the source for xml-zip, it doesn't seem like it is entirely lazy:

(defn xml-zip
  "Returns a zipper for xml elements (as from xml/parse),
  given a root element"
  {:added "1.0"}
  [root]
    (zipper (complement string?) 
            (comp seq :content)
            (fn [node children]
              (assoc node :content (and children (apply vector children))))
            root))

Note (apply vector children), which is materializing the children seq to a vector (although it is not materializing the entire descendant tree, so it's still lazy). If you have a very large number of children for a node (e.g., children of <mediawiki>), then even this level of laziness is not enough--:content needs to be a seq too.

My knowledge of zippers is extremely limited, so I'm not sure why vector is being used here at all; see if replacing (assoc node :content (and children (apply vector children)))) with (assoc node :content children) works, which should keep children as a normal sequence without materializing it.

(For that matter, I'm not sure why (apply vector children) instead of (vec children)...)

content-handler looks like it is building up all content elements as well in *contents*, so the source of the OOM may be in the content-handler itself.

I'm not sure how we can reconcile the zipper interface (tree-like) with the streaming you want. It will work for large xml, but not huge xml.

In similar approaches in other languages (e.g. Python's iterparse) a tree is built up iteratively like with zipper. The difference is that the tree will be pruned after successful element processing.

For example, in Python with iterparse you would listen for an endElement event on page (i.e. when </page> occurs in the XML.) At this point you know you have a complete page element which you can process as a tree. After you are finished, you delete the element you just processed and the sibling branches, which controls memory usage.

Perhaps you can take this approach here as well. The node provided by the xml zipper is a var to an xml/element. The content handler could return a function that does cleanup on its *current* var when invoked. Then you can call it to prune the tree.

Alternatively, you could use SAX "by hand" in clojure for the root element, and create a zipper for each page element as you encounter it.

来源：https://stackoverflow.com/questions/16289991/outofmemoryerror-when-parsing-xml-in-clojure-with-data-zip

标签

xml

clojure

out-of-memory