Process large JSON stream with jq

前端 未结 2 822
生来不讨喜
生来不讨喜 2021-01-12 15:21

I get a very large JSON stream (several GB) from curl and try to process it with jq.

The relevant output I want to parse with jq

相关标签:
2条回答
  • 2021-01-12 15:43

    To get:

    {"key1": "row1", "key2": "row1"}
    {"key1": "row2", "key2": "row2"}
    

    From:

    {
      "results":[
        {
          "columns": ["n"],
          "data": [    
            {"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]},
            {"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]}
          ]
        }
      ],
      "errors": []
    }
    

    Do the following, which is equivalent to jq -c '.results[].data[].row[]', but using streaming:

    jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.[0][0] == "results" and .[0][2] == "data" and .[0][4] == "row") | del(.[0][0:5])))'
    

    What this does is:

    • Turn the JSON into a stream (with --stream)
    • Select the path .results[].data[].row[] (with select(.[0][0] == "results" and .[0][2] == "data" and .[0][4] == "row")
    • Discard those initial parts of the path, like "results",0,"data",0,"row" (with del(.[0][0:5]))
    • And finally turn the resulting jq stream back into the expected JSON with the fromstream(1|truncate_stream(…)) pattern from the jq FAQ

    For example:

    echo '
      {
        "results":[
          {
            "columns": ["n"],
            "data": [    
              {"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]},
              {"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]}
            ]
          }
        ],
        "errors": []
      }
    ' | jq -cn --stream '
      fromstream(1|truncate_stream(
        inputs | select(
          .[0][0] == "results" and 
          .[0][2] == "data" and 
          .[0][4] == "row"
        ) | del(.[0][0:5])
      ))'
    

    Produces the desired output.

    0 讨论(0)
  • 2021-01-12 15:44

    (1) The vanilla filter you would use would be as follows:

    jq -r -c '.results[0].data[].row'
    

    (2) One way to use the streaming parser here would be to use it to process the output of .results[0].data, but the combination of the two steps will probably be slower than the vanilla approach.

    (3) To produce the output you want, you could run:

    jq -nc --stream '
      fromstream(inputs
        | select( [.[0][0,2,4]] == ["results", "data", "row"])
        | del(.[0][0:5]) )'
    

    (4) Alternatively, you may wish to try something along these lines:

    jq -nc --stream 'inputs
          | select(length==2)
          | select( [.[0][0,2,4]] == ["results", "data", "row"])
          | [ .[0][6], .[1]] '
    

    For the illustrative input, the output from the last invocation would be:

    ["key1","row1"] ["key2","row1"] ["key1","row2"] ["key2","row2"]

    0 讨论(0)
提交回复
热议问题