Process huge GEOJson file with jq

后端 未结 4 1979
庸人自扰
庸人自扰 2021-01-24 07:26

Given a GEOJson file as follows:-

{
  \"type\": \"FeatureCollection\",
  \"features\": [
   {
     \"type\": \"Feature\",
     \"properties\": {
     \"FEATCODE\         


        
相关标签:
4条回答
  • 2021-01-24 07:33

    An alternative solution could be for example:

    jq '.features |= map_values(.tippecanoe.minzoom = 13)'
    

    To test this, I created a sample JSON as

    d = {'features': [{"type":"Feature", "properties":{"FEATCODE": 15014}} for i in range(0,N)]}
    

    and inspected the execution time as a function of N. Interestingly, while the map_values approach seems to have linear complexity in N, .features[].tippecanoe.minzoom = 13 exhibits quadratic behavior (already for N=50000, the former method finishes in about 0.8 seconds, while the latter needs around 47 seconds)

    Alternatively, one might just do it manually with, e.g., Python:

    import json
    import sys
    
    data = {}
    with open(sys.argv[1], 'r') as F:
        data = json.load(F)
    
    extra_item = {"minzoom" : 13}
    for feature in data['features']:
        feature["tippecanoe"] = extra_item
    
    with open(sys.argv[2], 'w') as F:
        F.write(json.dumps(data))
    
    0 讨论(0)
  • 2021-01-24 07:37

    A one-pass jq-only approach may require more RAM than is available. If that is the case, then a simple all-jq approach is shown below, together with a more economical approach based on using jq along with awk.

    The two approaches are the same except for the reconstitution of the stream of objects into a single JSON document. This step can be accomplished very economically using awk.

    In both cases, the large JSON input file with objects of the required form is assumed to be named input.json.

    jq-only

    jq -c  '.features[]' input.json |
        jq -c '.tippecanoe.minzoom = 13' |
        jq -c -s '{type: "FeatureCollection", features: .}'
    

    jq and awk

    jq -c '.features[]' input.json |
       jq -c '.tippecanoe.minzoom = 13' | awk '
         BEGIN {print "{\"type\": \"FeatureCollection\", \"features\": ["; }
         NR==1 { print; next }
               {print ","; print}
         END   {print "] }";}'
    

    Performance comparison

    For comparison, an input file with 10,000,000 objects in .features[] was used. Its size is about 1GB.

    u+s:

    jq-only:              15m 15s
    jq-awk:                7m 40s
    jq one-pass using map: 6m 53s
    
    0 讨论(0)
  • 2021-01-24 07:38

    In this case, map rather than map_values is far faster (*):

    .features |= map(.tippecanoe.minzoom = 13)
    

    However, using this approach will still require enough RAM.

    p.s. If you want to use jq to generate a large file for timing, consider:

    def N: 1000000;
    
    def data:
       {"features": [range(0;N) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };
    

    (*) Using map, 20s for 100MB, and approximately linear.

    0 讨论(0)
  • 2021-01-24 07:52

    Here, based on the work of @nicowilliams at GitHub, is a solution that uses the streaming parser available with jq. The solution is very economical with memory, but is currently quite slow if the input is large.

    The solution has two parts: a function for injecting the update into the stream produced using the --stream command-line option; and a function for converting the stream back to JSON in the original form.

    Invocation:

    jq -cnr --stream -f program.jq input.json
    

    program.jq

    # inject the given object into the stream produced from "inputs" with the --stream option
    def inject(object):
      [object|tostream] as $object
      | 2
      | truncate_stream(inputs)
      | if (.[0]|length == 1) and length == 1
        then $object[]
        else .
        end ;
    
    # Input: the object to be added
    # Output: text
    def output:
      . as $object
      | ( "[",
          foreach fromstream( inject($object) ) as $o
            (0;
             if .==0 then 1 else 2 end;
             if .==1 then $o else ",", $o end),
          "]" ) ;
    
    {}
    | .tippecanoe.minzoom = 13
    | output
    

    Generation of test data

    def data(N):
     {"features":
      [range(0;2) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };
    

    Example output

    With N=2:

    [
    {"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
    ,
    {"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
    ]
    
    0 讨论(0)
提交回复
热议问题