I need to split large files (~5G
) of JSON data into smaller files with newline-delimited JSON in a memory efficient way (i.e., without having to r
jq's streaming parser (the one invoked with the --stream command-line option) intentionally sacrifices speed for the sake of reduced memory requirements, as illustrated below in the metrics section. A tool which strikes a different balance (one which seems to be closer to what you're looking for) is jstream
, the homepage of which is https://github.com/bcicen/jstream
Running the sequence of commands in a bash or bash-like shell:
cd
go get github.com/bcicen/jstream
cd go/src/github.com/bcicen/jstream/cmd/jstream/
go build
will result in an executable, which you can invoke like so:
jstream -d 1 < INPUTFILE > STREAM
Assuming INPUTFILE contains a (possibly ginormous) JSON array, the above will behave like jq's .[]
, with jq's -c (compact) command-line option. In fact, this is also the case if INPUTFILE contains a stream of JSON arrays, or a stream of JSON non-scalars ...
For the task at hand (streaming the top-level items of an array):
mrss u+s
jq --stream: 2 MB 447
jstream : 8 MB 114
jq : 5,582 MB 39
In words:
space
: jstream is economical with memory, but not as much as jq's streaming parser.
time
: jstream runs slightly slower than jq's regular parser
but about 4 times faster than jq's streaming parser.
Interestingly, space*time is about the same for the two streaming parsers.
The test file consists of an array of 10,000,000 simple objects:
[
{"key_one": 0.13888342355537053, "key_two": 0.4258700286271502, "key_three": 0.8010012924267487}
,{"key_one": 0.13888342355537053, "key_two": 0.4258700286271502, "key_three": 0.8010012924267487}
...
]
$ ls -l input.json
-rw-r--r-- 1 xyzzy staff 980000002 May 2 2019 input.json
$ wc -l input.json
10000001 input.json
$ /usr/bin/time -l jq empty input.json
43.91 real 37.36 user 4.74 sys
4981452800 maximum resident set size
$ /usr/bin/time -l jq length input.json
10000000
48.78 real 41.78 user 4.41 sys
4730941440 maximum resident set size
/usr/bin/time -l jq type input.json
"array"
37.69 real 34.26 user 3.05 sys
5582196736 maximum resident set size
/usr/bin/time -l jq 'def count(s): reduce s as $i (0;.+1); count(.[])' input.json
10000000
39.40 real 35.95 user 3.01 sys
5582176256 maximum resident set size
/usr/bin/time -l jq -cn --stream 'fromstream(1|truncate_stream(inputs))' input.json | wc -l
449.88 real 444.43 user 2.12 sys
2023424 maximum resident set size
10000000
$ /usr/bin/time -l jstream -d 1 < input.json > /dev/null
61.63 real 79.52 user 16.43 sys
7999488 maximum resident set size
$ /usr/bin/time -l jstream -d 1 < input.json | wc -l
77.65 real 93.69 user 20.85 sys
7847936 maximum resident set size
10000000