Improving performance when using jq to process large files

后端 未结 2 522
走了就别回头了
走了就别回头了 2021-02-19 00:36

Use Case

I need to split large files (~5G) of JSON data into smaller files with newline-delimited JSON in a memory efficient way (i.e., without having to r

2条回答
  •  爱一瞬间的悲伤
    2021-02-19 01:16

    Restrictions

    In the general case, JSON needs parsing with a tool that can understand JSON. You could make an exception and follow these suggestions, only if you are sure that:

    • You have an array with flat JSON objects (like in the use case) without nested objects.

    • Curly braces do not exist anywhere inside the objects, that means you don't have any content like this: {id:1, name:"foo{bar}"}.


    Use the shell

    If the above conditions are met, you can use the shell to convert to JSONL and split to smaller files, and it would be many times faster than JSON parsing or full text-processing. Additonally it can be almost memoryless, especially if you use core-utils with or without some sed or awk.

    Even the simpler approach:

    grep -o '{[^}]*}' file.json
    

    will be faster, but will need some memory (less than jq).

    And the sed commands you have tried are fast, but need memory, because sed, the stream editor, is reading line by line, and if the file has no newlines at all, it will load all of it into memory, sed needs 2-3 times the size of the maximum line of the stream. But if you first split the stream with newlines, using core-utils like tr, cut etc, then memory usage is extremely low, with great perfomance.


    Solution

    After some testing, I found this one to be faster and memoryless. Besides that, it doesn't depend on the extra characters outside the objects, like comma and a few spaces, or comma alone etc. It will only match the objects {...} and print each of them to a new line.

    #!/bin/sh -
    LC_ALL=C < "$1" cut -d '}' -f1- --output-delimiter="}"$'\n' |\
        cut -sd '{' -f2 | sed 's/^/{/' > "$2"
    

    to split the JSONL, use -l rather than -c, to ensure you don't split any object, use something like this:

    split -l 1000 -d --additional-suffix='.json' - path/to/file/prefix
    

    or all together

    #!/bin/sh -
    n=1000
    LC_ALL=C < "$1" cut -d '}' -f1- --output-delimiter="}"$'\n' |\
        cut -sd '{' -f2 | sed 's/^/{/' |\
        split -l "$n" -d --additional-suffix='.json' - "$2"
    

    Usage:

    sh script.sh input.json path/to/new/files/output
    

    will create files output1.json, output2.json etc in the selected path.

    Note: If your stream contains non UTF-8 multi-bute characters, remove LC_ALL=C, it is just a small speed optimization which is not necessary.

    Note: I have assumed input with no newlines at all, or with newlines like in your first use case. To generalize and include any newlines anywhere in the file, I add a small modification. In this version tr will truncate all newlines initially, with almost no impact to perfomance:

    #!/bin/sh -
    n=1000
    LC_ALL=C < "$1" tr -d $'\n' |\
        cut -d '}' -f1- --output-delimiter="}"$'\n' |\
        cut -sd '{' -f2 | sed 's/^/{/' > "$2"
    

    Testing

    Here are some testing results. They are representative, times were similar for all executions.

    Here is the script I used, with input for various values of n:

    #!/bin/bash
    
    make_json() {
        awk -v n=2000000 'BEGIN{
            x = "{\"id\": 1, \"name\": \"foo\"}"
            printf "["
            for (i=1;i big.json
        return 0
    }
    
    tf="Real: %E  System: %S  User: %U  CPU%%: %P  Maximum Memory: %M KB\n"
    make_json
    
    for i in {1..7}; do
        printf "\n==> "
        cat "${i}.sh"
        command time -f "$tf" sh "${i}.sh" big.json "output${i}.json"
    done
    

    I used small files when testing together with jq because it gets early into swap. Then with larger files using only the efficient solutions.

    ==> LC_ALL=C jq -c '.[]' "$1" > "$2"
    Real: 0:16.26  System: 1.46  User: 14.74  CPU%: 99%  Maximum Memory: 1004200 KB
    
    
    ==> LC_ALL=C jq length "$1" > /dev/null
    Real: 0:09.19  System: 1.30  User: 7.85  CPU%: 99%  Maximum Memory: 1002912 KB
    
    
    ==> LC_ALL=C < "$1" sed 's/^\[//; s/}[^}]*{/}\n{/g; s/]$//' > "$2"
    Real: 0:02.21  System: 0.33  User: 1.86  CPU%: 99%  Maximum Memory: 153180 KB
    
    
    ==> LC_ALL=C < "$1" grep -o '{[^}]*}' > "$2"
    Real: 0:02.08  System: 0.34  User: 1.71  CPU%: 99%  Maximum Memory: 103064 KB
    
    
    ==> LC_ALL=C < "$1" awk -v RS="}, {" -v ORS="}\n{" '1' |\
        head -n -1 | sed '1 s/^\[//; $ s/]}$//' > "$2"
    Real: 0:01.38  System: 0.32  User: 1.52  CPU%: 134%  Maximum Memory: 3468 KB
    
    
    ==> LC_ALL=C < "$1" cut -d "}" -f1- --output-delimiter="}"$'\n' |\
        sed '1 s/\[//; s/^, //; $d;' > "$2"
    Real: 0:00.94  System: 0.24  User: 0.99  CPU%: 131%  Maximum Memory: 3488 KB
    
    
    ==> LC_ALL=C < "$1" cut -d '}' -f1- --output-delimiter="}"$'\n' |\
        cut -sd '{' -f2 | sed 's/^/{/' > "$2"
    Real: 0:00.63  System: 0.28  User: 0.86  CPU%: 181%  Maximum Memory: 3448 KB
    
    # Larger files testing
    
    ==> LC_ALL=C < "$1" grep -o '{[^}]*}' > "$2"
    Real: 0:20.99  System: 2.98  User: 17.80  CPU%: 99%  Maximum Memory: 1017304 KB
    
    
    ==> LC_ALL=C < "$1" awk -v RS="}, {" -v ORS="}\n{" '1' |\
        head -n -1 | sed '1 s/^\[//; $ s/]}$//' > "$2"
    Real: 0:16.44  System: 2.96  User: 15.88  CPU%: 114%  Maximum Memory: 3496 KB
    
    
    ==> LC_ALL=C < "$1" cut -d "}" -f1- --output-delimiter="}"$'\n' |\
        sed '1 s/\[//; s/^, //; $d;' > "$2"
    Real: 0:09.34  System: 1.93  User: 10.27  CPU%: 130%  Maximum Memory: 3416 KB
    
    
    ==> LC_ALL=C < "$1" cut -d '}' -f1- --output-delimiter="}"$'\n' |\
        cut -sd '{' -f2 | sed 's/^/{/' > "$2"
    Real: 0:07.22  System: 2.79  User: 8.74  CPU%: 159%  Maximum Memory: 3380 KB
    
    

提交回复
热议问题