问题
I have around 96 gzip
of JSON which is over 350 GB of JSON file after unzipping with following structure
{
"structe": {},
"beta": {},
"flow": {
"1023": {
"0101": {
"-LEjllNyHqdHYGntO6vu": {
"status": "1",
"t": 1528736191996
},
"-LEjllcXKaVOQu3BDpHF": {
"status": "1",
"t": 1528736192996
}
},
"0102": {
"-LEjllNyHqdHYGntO6vu": {
"status": "1",
"t": 1528736191996
},
"-LEjllcXKaVOQu3BDpHF": {
"status": "1",
"t": 1528736192996
}
}
},
"1024": {
"0103": {
"-LEjllNyHqdHYGntO6vu": {
"lat": 51.128676733981,
"lng": -113.9318991267252,
"status": "1",
"t": 1528736191996
},
"-LEjllcXKaVOQu3BDpHF": {
"lat": 51.128676733981,
"lng": -113.9318991267252,
"status": "1",
"t": 1528736192996
}
}
}
}
}
I can't load this in RAM , Now I want to stream this file and pull the path flow->1023(let id1)->0101(let id2)
into new id1_id2.json
file. Any thought how can do this with speed.
Output i am looking for is like
File name = 1023_0101.json
{
"-LEjllNyHqdHYGntO6vu": {
"status": "1",
"t": 1528736191996
},
"-LEjllcXKaVOQu3BDpHF": {
"status": "1",
"t": 1528736192996
}
}
回答1:
Here's a solution that uses jq's streaming parser to produce a stream consisting of $id1, $id2, and the corresponding value of interest; this stream can then be piped into another tool (e.g. awk if that's convenient) to produce the desired files.
In the following, we use atomize
from the jq cookbook:
def atomize(s):
fromstream(foreach s as $in ( {previous:null, emit: null};
if ($in | length == 2) and ($in|.[0][0]) != .previous and .previous != null
then {emit: [[.previous]], previous: $in|.[0][0]}
else { previous: ($in|.[0][0]), emit: null}
end;
(.emit // empty), $in) ) ;
The main jq program (invoked with --stream -n -c) is then simply:
atomize(inputs)
| select(type == "object" and .flow)
| .flow
| keys_unsorted[] as $id1
| (.[$id1] | keys_unsorted[]) as $id2
| $id1, $id2, .[$id1][$id2]
So for each gzip file, $gz, the pipeline would look like this:
gunzip -c $gz | jq -nc --stream -f program.jq | awk ....
For an example of using awk to produce the desired result, see jq, split a huge json of array and save into file named with a value
Caveat and Addendum
jq's streaming parser avoids using RAM at the cost of speed, so usually using the --stream option is only done as a last resort. From the description of the problem, it looks like you might be able to process some of the zipped files using jq's regular parser, so you might want to process those files speedily, leaving the "atomize" approach for those files that are too big.
Caution
The problem description does not make it clear what should be done if there is an id1_id2.json collision. If there is no possibility of such a collision, then of course there's no problem. Otherwise, it would be up to the program that creates those files to manage that contingency.
回答2:
You can use jq
with the --stream
option, jq - I/O (Streaming) set, that reads texts in a streaming fashion, allowing programs to start processing large JSON texts immediately rather than after the parse completes (storing entire file in RAM).
Assuming your input id strings are stored in a shell variable context
id1=1023; id2=0101
Pipe the output of your huge gzip
to the following filter
jq --arg v1 "$id1" --arg v2 "$id2" --stream 'fromstream(inputs)| objects | .flow[$v1][$v2]' > "$id1"_"$id2".json
(or) if the id names can't be pre-fetched and you need to fetch them on the run, directly use their names as
jq --stream 'fromstream(inputs)| objects | .flow."1023"."0101"'
回答3:
What first coming on my mind is treating the file like stream and reading it line by line. There are some libraries already which are treating the json files as streams. For example, you can check out the snippet from ijson library:
For JSON like:
{
"earth": {
"europe": [
{"name": "Paris", "type": "city", "info": { ... }},
{"name": "Thames", "type": "river", "info": { ... }},
// ...
],
"america": [
{"name": "Texas", "type": "state", "info": { ... }},
// ...
]
}
}
Treatment would look like:
import ijson
parser = ijson.parse(urlopen('http://.../'))
stream.write('<geo>')
for prefix, event, value in parser:
if (prefix, event) == ('earth', 'map_key'):
stream.write('<%s>' % value)
continent = value
elif prefix.endswith('.name'):
stream.write('<object name="%s"/>' % value)
elif (prefix, event) == ('earth.%s' % continent, 'end_map'):
stream.write('</%s>' % continent)
stream.write('</geo>')
来源:https://stackoverflow.com/questions/58408121/stream-parse-huge-json-file-into-small-files