Improving performance when using jq to process large files

后端 未结 2 536
走了就别回头了
走了就别回头了 2021-02-19 00:36

Use Case

I need to split large files (~5G) of JSON data into smaller files with newline-delimited JSON in a memory efficient way (i.e., without having to r

2条回答
  •  我寻月下人不归
    2021-02-19 01:18

    jq's streaming parser (the one invoked with the --stream command-line option) intentionally sacrifices speed for the sake of reduced memory requirements, as illustrated below in the metrics section. A tool which strikes a different balance (one which seems to be closer to what you're looking for) is jstream, the homepage of which is https://github.com/bcicen/jstream

    Running the sequence of commands in a bash or bash-like shell:

    cd
    go get github.com/bcicen/jstream
    cd go/src/github.com/bcicen/jstream/cmd/jstream/
    go build
    

    will result in an executable, which you can invoke like so:

    jstream -d 1 < INPUTFILE > STREAM
    

    Assuming INPUTFILE contains a (possibly ginormous) JSON array, the above will behave like jq's .[], with jq's -c (compact) command-line option. In fact, this is also the case if INPUTFILE contains a stream of JSON arrays, or a stream of JSON non-scalars ...

    Illustrative space-time metrics

    Summary

    For the task at hand (streaming the top-level items of an array):

                      mrss   u+s
    jq --stream:      2 MB   447
    jstream    :      8 MB   114
    jq         :  5,582 MB    39
    

    In words:

    1. space: jstream is economical with memory, but not as much as jq's streaming parser.

    2. time: jstream runs slightly slower than jq's regular parser but about 4 times faster than jq's streaming parser.

    Interestingly, space*time is about the same for the two streaming parsers.

    Characterization of the test file

    The test file consists of an array of 10,000,000 simple objects:

    [
    {"key_one": 0.13888342355537053, "key_two": 0.4258700286271502, "key_three": 0.8010012924267487}
    ,{"key_one": 0.13888342355537053, "key_two": 0.4258700286271502, "key_three": 0.8010012924267487}
    ...
    ]
    
    $ ls -l input.json
    -rw-r--r--  1 xyzzy  staff  980000002 May  2  2019 input.json
    
    $ wc -l input.json
     10000001 input.json
    

    jq times and mrss

    $ /usr/bin/time -l jq empty input.json
           43.91 real        37.36 user         4.74 sys
    4981452800  maximum resident set size
    
    $ /usr/bin/time -l jq length input.json
    10000000
           48.78 real        41.78 user         4.41 sys
    4730941440  maximum resident set size
    
    /usr/bin/time -l jq type input.json
    "array"
           37.69 real        34.26 user         3.05 sys
    5582196736  maximum resident set size
    
    /usr/bin/time -l jq 'def count(s): reduce s as $i (0;.+1); count(.[])' input.json
    10000000
           39.40 real        35.95 user         3.01 sys
    5582176256  maximum resident set size
    
    /usr/bin/time -l jq -cn --stream 'fromstream(1|truncate_stream(inputs))' input.json | wc -l
          449.88 real       444.43 user         2.12 sys
       2023424  maximum resident set size
     10000000
    

    jstream times and mrss

    $ /usr/bin/time -l jstream -d 1 < input.json > /dev/null
           61.63 real        79.52 user        16.43 sys
       7999488  maximum resident set size
    
    $ /usr/bin/time -l jstream -d 1 < input.json | wc -l
           77.65 real        93.69 user        20.85 sys
       7847936  maximum resident set size
     10000000
    
    

提交回复
热议问题