Flatten nested JSON using jq

前端 未结 4 401
独厮守ぢ
独厮守ぢ 2021-02-04 09:58

I\'d like to flatten a nested json object, e.g. {\"a\":{\"b\":1}} to {\"a.b\":1} in order to digest it in solr.

I have 11 TB of json files whi

相关标签:
4条回答
  • 2021-02-04 10:38

    You can also use the following jq command to flatten nested JSON objects in this manner:

    [leaf_paths as $path | {"key": $path | join("."), "value": getpath($path)}] | from_entries
    

    The way it works is: leaf_paths returns a stream of arrays which represent the paths on the given JSON document at which "leaf elements" appear, that is, elements which do not have child elements, such as numbers, strings and booleans. We pipe that stream into objects with key and value properties, where key contains the elements of the path array as a string joined by dots and value contains the element at that path. Finally, we put the entire thing in an array and run from_entries on it, which transforms an array of {key, value} objects into an object containing those key-value pairs.

    0 讨论(0)
  • 2021-02-04 10:41

    As it turns out, curl -XPOST 'http://localhost:8983/solr/flat/update/json/docs' -d @json_file does just this:

    {
        "a.b":[1],
        "id":"24e3e780-3a9e-4fa7-9159-fc5294e803cd",
        "_version_":1535841499921514496
    }
    

    EDIT 1: solr 6.0.1 with bin/solr -e cloud. collection name is flat, all the rest are default (with data-driven-schema which is also default).

    EDIT 2: The final script I used: find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d @{} \;.

    EDIT 3: Is is also possible to parallel with xargs and to add the id field with jq: find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .a.b}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d @-" where -P is the parallelism factor. I used jq to set an id so multiple uploads of the same document won't create duplicates in the collection (when I searched for the optimal value of -P it created duplicates in the collection)

    0 讨论(0)
  • 2021-02-04 10:54

    This is just a variant of Santiago's jq:

    . as $in 
    | reduce leaf_paths as $path ({};
         . + { ($path | map(tostring) | join(".")): $in | getpath($path) })
    

    It avoids the overhead of the key/value construction and destruction.

    (If you have access to a version of jq later than jq 1.5, you can omit the "map(tostring)".)

    Two important points about both these jq solutions:

    1. Arrays are also flattened. E.g. given {"a": {"b": [0,1,2]}} as input, the output would be:

      {
        "a.b.0": 0,
        "a.b.1": 1,
        "a.b.2": 2
      }
      
    2. If any of the keys in the original JSON contain periods, then key collisions are possible; such collisions will generally result in the loss of a value. This would happen, for example, with the following input:

      {"a.b":0, "a": {"b": 1}}
      
    0 讨论(0)
  • 2021-02-04 10:54

    Here is a solution that uses tostream, select, join, reduce and setpath

      reduce ( tostream | select(length==2) | .[0] |= [join(".")] ) as [$p,$v] (
         {}
         ; setpath($p; $v)
      )
    
    0 讨论(0)
提交回复
热议问题