Streaming json parser [duplicate]

后端未结

关注

 2  1373

野的像风

相关标签:

2条回答

面向向阳花

2021-01-17 23:38

If the file contains one large JSON object (either array or map), then per the JSON spec, you must read the entire object before you can access its components.

If for instance the file is an array with objects [ {...}, {...} ] then newline delimited JSON is far more efficient since you only have to keep one object in memory at a time and the parser only has to read one line before it can begin processing.

If you need to keep track of some of the objects for later use during parsing, I suggest creating a dict to hold those specific records of running values as your iterate the file.

Say you have JSON

{"timestamp": 1549480267882, "sensor_val": 1.6103881016325283}
{"timestamp": 1549480267883, "sensor_val": 9.281329310309406}
{"timestamp": 1549480267883, "sensor_val": 9.357327083443344}
{"timestamp": 1549480267883, "sensor_val": 6.297722749124474}
{"timestamp": 1549480267883, "sensor_val": 3.566667175421604}
{"timestamp": 1549480267883, "sensor_val": 3.4251473635178655}
{"timestamp": 1549480267884, "sensor_val": 7.487766674770563}
{"timestamp": 1549480267884, "sensor_val": 8.701853236245032}
{"timestamp": 1549480267884, "sensor_val": 1.4070662393018396}
{"timestamp": 1549480267884, "sensor_val": 3.6524325449499995}
{"timestamp": 1549480455646, "sensor_val": 6.244199614422415}
{"timestamp": 1549480455646, "sensor_val": 5.126780276231609}
{"timestamp": 1549480455646, "sensor_val": 9.413894020722314}
{"timestamp": 1549480455646, "sensor_val": 7.091154829208067}
{"timestamp": 1549480455647, "sensor_val": 8.806417239029447}
{"timestamp": 1549480455647, "sensor_val": 0.9789474417767674}
{"timestamp": 1549480455647, "sensor_val": 1.6466189633300243}

You can process this with

import json
from collections import deque

# RingBuffer from https://www.daniweb.com/programming/software-development/threads/42429/limit-size-of-a-list
class RingBuffer(deque):
    def __init__(self, size):
        deque.__init__(self)
        self.size = size

    def full_append(self, item):
        deque.append(self, item)
        # full, pop the oldest item, left most item
        self.popleft()

    def append(self, item):
        deque.append(self, item)
        # max size reached, append becomes full_append
        if len(self) == self.size:
            self.append = self.full_append

    def get(self):
        """returns a list of size items (newest items)"""
        return list(self)


def proc_data():
    # Declare some state management in memory to keep track of whatever you want
    # as you iterate through the objects
    metrics = {
        'latest_timestamp': 0,
        'last_3_samples': RingBuffer(3)
    }

    with open('test.json', 'r') as infile:        
        for line in infile:
            # Load each line
            line = json.loads(line)
            # Do stuff with your running metrics
            metrics['last_3_samples'].append(line['sensor_val'])
            if line['timestamp'] > metrics['latest_timestamp']:
                metrics['latest_timestamp'] = line['timestamp']

    return metrics

print proc_data()

0 讨论(0)

心在旅途

2021-01-17 23:49
Consider converting this json into filesystem tree (folders & files), So that every json object is converted to a folder, that contains files:
- name.txt - contains name of the property in parent folder (json-object), value of the property is the current folder (json-object)
- properties_000000001.txt
- properties_000000002.txt
  
  ....
every properties_X.txt file contains at most N (limited number) lines property_name: property_value:
- "number_property": 100
- "boolean_property": true
- "object_property": folder(folder_0000001)
- "array_property": folder(folder_000002)
folder_0000001, folder_000002 - names of local folders

every array is converted to a folder with files:
- name.txt
- elements_0000000001.txt
- elements_0000000002.txt
  
  ....
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题