Writing to JSON file, then reading this same file and getting “JSONDecodeError: Extra data”

后端 未结 2 1788
暗喜
暗喜 2021-01-29 02:16

I have a very large json file (9GB). I\'m reading in one object from it at a time, and then deleting key-value pairs in this object when the key is not in the list fields<

相关标签:
2条回答
  • 2021-01-29 03:01

    Here's code that seems to work with your sample input. As I said in a comment the file you are dealing with is in something called JSON Lines format rather than JSON format.

    Since you appear to want the cleaned version in that same format (in other words, not converted to standard JSON format, as I thought a one point), here's how to do that:

    import json
    
    path_to_file = "sample_input.json"
    cleaned_file = "cleaned.json"
    
    # Fields to keep.
    fields = ["skills", "industry", "summary", "education", "experience"]
    
    # Clean profiles in JSON Lines format file.
    with open(path_to_file, encoding='UTF8') as inf, \
         open(cleaned_file, 'w', encoding='UTF8') as outf:
    
        for line in inf:
            profile = json.loads(line)  # Read a profile object.
            for key in list(profile.keys()):  # Remove unwanted fields it.
                if key not in fields:
                    del profile[key]
            outf.write(json.dumps(profile) + '\n') # Write cleaned profile to new file
    
    # Test whether it worked.
    with open(cleaned_file, encoding='UTF8') as cleaned:
        for line in cleaned:
            profile = json.loads(line)
            print(json.dumps(profile, indent=4))
    
    0 讨论(0)
  • 2021-01-29 03:07

    You are basically dumping new json objects into a file every time you are calling json.dump(profile, f). But that does not generate valid JSON, since it does not emped the objects correctly. E.g. {}{} instead of {{},{}}

    As for a solution - the size of your JSON makes reading / writing while holding everything in memory a bad solution. I would probably try the library https://pypi.org/project/jsonstreams/ or something like this.

    0 讨论(0)
提交回复
热议问题