Parsing a large (30MB) JSON file with net.liftweb.json or scala.util.parsing.json gives OutOfMemoryException. Any recommendations?

问题

I have a JSON file containing quite a lot of test data, which I want to parse and push through an algorithm I'm testing. It's about 30MB in size, with a list of 60,000 or so elements. I initially tried the simple parser in scala.util.parsing.json, like so:

import scala.util.parsing.json.JSON
val data = JSON.parseFull(Source.fromFile(path) mkString)

Where path is just a string containing the path the big JSON file. That chugged away for about 45 minutes, then threw this:

java.lang.OutOfMemoryError: GC overhead limit exceeded

Someone then pointed out to me that nobody uses this library and I should use Lift's JSON parser. So I tried this in my Scala REPL:

scala> import scala.io.Source
import scala.io.Source

scala> val s = Source.fromFile("path/to/big.json")
s: scala.io.BufferedSource = non-empty iterator

scala> val data = parse(s mkString)
java.lang.OutOfMemoryError: GC overhead limit exceeded

This time it only took about 3 minutes, but the same error.

So, obviously I could break the file up into smaller ones, iterate over the directory of JSON files and merge my data together piece-by-piece, but I'd rather avoid it if possible. Does anyone have any recommendations?

For further information -- I'd been working with this same dataset the past few weeks in Clojure (for visualization with Incanter) without issues. The following works perfectly fine:

user=> (use 'clojure.data.json)
nil
user=> (use 'clojure.java.io)
nil

user=> (time (def data (read-json (reader "path/to/big.json"))))
"Elapsed time: 19401.629685 msecs"
#'user/data

回答1:

Those messages indicate that the application is spending more than 98% of its time collecting garbage.

I'd suspect that Scala is generating a lot of short-lived objects, which is what is causing the excessive GCs. You can verify the GC performance by adding the -verbosegc command line switch to java.

The default max heap size on Java 1.5+ server VM is 1 GB (or 1/4 of installed memory, whichever is less), which should be sufficient for your purposes, but you may want to increase the new generation to see if that improves your performance. On the Oracle VM, this is done with the -Xmn option. Try setting the following environment variable:

$JAVA_OPTS=-server -Xmx1024m -Xms1024m -Xmn2m -verbosegc -XX:+PrintGCDetails

and re-running your application.

You should also check out this tuning guide for details.

回答2:

Try using Jerkson instead. Jerkson uses Jackson underneath, which repeatedly scores as the fastest and most memory efficient JSON parser on the JVM.

I've used both Lift JSON and Jerkson in production, and Jerkson's performance was signficantly better than Lift's (especially when parsing and generating large JSON documents).

来源：https://stackoverflow.com/questions/8898353/parsing-a-large-30mb-json-file-with-net-liftweb-json-or-scala-util-parsing-jso

标签

json

scala

lift