Process huge GEOJson file with jq

后端未结

关注

 4  1979

Given a GEOJson file as follows:-

{
  \"type\": \"FeatureCollection\",
  \"features\": [
   {
     \"type\": \"Feature\",
     \"properties\": {
     \"FEATCODE\


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  逝去的感伤        
                
              
                            
                2021-01-24 07:33
              
            
            
                                                                       
An alternative solution could be for example:

jq '.features |= map_values(.tippecanoe.minzoom = 13)'


To test this, I created a sample JSON as

d = {'features': [{"type":"Feature", "properties":{"FEATCODE": 15014}} for i in range(0,N)]}


and inspected the execution time as a function of N. Interestingly, while the map_values approach seems to have linear complexity in N, .features[].tippecanoe.minzoom = 13 exhibits quadratic behavior (already for N=50000, the former method finishes in about 0.8 seconds, while the latter needs around 47 seconds)

Alternatively, one might just do it manually with, e.g., Python:

import json
import sys

data = {}
with open(sys.argv[1], 'r') as F:
    data = json.load(F)

extra_item = {"minzoom" : 13}
for feature in data['features']:
    feature["tippecanoe"] = extra_item

with open(sys.argv[2], 'w') as F:
    F.write(json.dumps(data))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2021-01-24 07:37
              
            
            
                                                                       
A one-pass jq-only approach may require more RAM than is available.  If that is the case, then a simple all-jq approach is shown below, together with a more economical approach based on using jq along with awk.

The two approaches are the same except for the reconstitution of the stream of objects into a single JSON document.  This step can be accomplished very economically using awk.

In both cases, the large JSON input file with objects of the required form is assumed to be named input.json.

jq-only

jq -c  '.features[]' input.json |
    jq -c '.tippecanoe.minzoom = 13' |
    jq -c -s '{type: "FeatureCollection", features: .}'


jq and awk

jq -c '.features[]' input.json |
   jq -c '.tippecanoe.minzoom = 13' | awk '
     BEGIN {print "{\"type\": \"FeatureCollection\", \"features\": ["; }
     NR==1 { print; next }
           {print ","; print}
     END   {print "] }";}'


Performance comparison

For comparison, an input file with 10,000,000 objects in .features[] was used. Its size is about 1GB.

u+s:

jq-only:              15m 15s
jq-awk:                7m 40s
jq one-pass using map: 6m 53s

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南旧        
                
              
                            
                2021-01-24 07:38
              
            
            
                                                                       
In this case, map rather than map_values is far faster (*): 

.features |= map(.tippecanoe.minzoom = 13)


However, using this approach will still require enough RAM.

p.s. If you want to use jq to generate a large file for timing, consider:

def N: 1000000;

def data:
   {"features": [range(0;N) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };


(*) Using map, 20s for 100MB, and approximately linear.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我寻月下人不归        
                
              
                            
                2021-01-24 07:52
              
            
            
                                                                       
Here, based on the work of @nicowilliams at GitHub, is a solution that uses the streaming parser available with jq.  The solution is very economical with memory, but is currently quite slow if the input is large.

The solution has two parts: a function for injecting the update into the stream produced using the --stream command-line option; and a function for converting the stream back to JSON in the original form.

Invocation:

jq -cnr --stream -f program.jq input.json


program.jq

# inject the given object into the stream produced from "inputs" with the --stream option
def inject(object):
  [object|tostream] as $object
  | 2
  | truncate_stream(inputs)
  | if (.[0]|length == 1) and length == 1
    then $object[]
    else .
    end ;

# Input: the object to be added
# Output: text
def output:
  . as $object
  | ( "[",
      foreach fromstream( inject($object) ) as $o
        (0;
         if .==0 then 1 else 2 end;
         if .==1 then $o else ",", $o end),
      "]" ) ;

{}
| .tippecanoe.minzoom = 13
| output


Generation of test data

def data(N):
 {"features":
  [range(0;2) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };


Example output

With N=2:

[
{"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
,
{"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复