Using python ijson to read a large json file with multiple json objects

前端 未结 2 1119
走了就别回头了
走了就别回头了 2021-02-19 07:45

I\'m trying to parse a large (~100MB) json file using ijson package which allows me to interact with the file in an efficient way. However, after writing some code like this,

相关标签:
2条回答
  • 2021-02-19 08:09

    Unfortunately the ijson library (v2.3 as of March 2018) does not handle parsing multiple JSON objects. It can only handle 1 overall object, and if you attempt to parse a second object, you will get an error: "ijson.common.JSONError: Additional data". See bug reports here:

    • https://github.com/isagalaev/ijson/issues/40
    • https://github.com/isagalaev/ijson/issues/42
    • https://github.com/isagalaev/ijson/issues/67
    • python: how do I parse a stream of json arrays with ijson library

    It's a big limitation. However, as long as you have line breaks (new line character) after each JSON object, you can parse each one line-by-line independently, like this:

    import io
    import ijson
    
    with open(filename, encoding="UTF-8") as json_file:
        cursor = 0
        for line_number, line in enumerate(json_file):
            print ("Processing line", line_number + 1,"at cursor index:", cursor)
            line_as_file = io.StringIO(line)
            # Use a new parser for each line
            json_parser = ijson.parse(line_as_file)
            for prefix, type, value in json_parser:
                print ("prefix=",prefix, "type=",type, "value=",value)
            cursor += len(line)
    

    You are still streaming the file, and not loading it entirely in memory, so it can work on large JSON files. It also uses the line streaming technique from: How to jump to a particular line in a huge text file? and uses enumerate() from: Accessing the index in 'for' loops?

    0 讨论(0)
  • 2021-02-19 08:12

    Since the provided chunk looks more like a set of lines each composing an independent JSON, it should be parsed accordingly:

    # each JSON is small, there's no need in iterative processing
    import json 
    with open(filename, 'r') as f:
        for line in f:
            data = json.loads(line)
            # data[u'name'], data[u'engine_speed'], data[u'timestamp'] now
            # contain correspoding values
    
    0 讨论(0)
提交回复
热议问题