I have a very large json file (9GB). I\'m reading in one object from it at a time, and then deleting key-value pairs in this object when the key is not in the list fields<
Here's code that seems to work with your sample input. As I said in a comment the file you are dealing with is in something called JSON Lines format rather than JSON format.
Since you appear to want the cleaned version in that same format (in other words, not converted to standard JSON format, as I thought a one point), here's how to do that:
import json
path_to_file = "sample_input.json"
cleaned_file = "cleaned.json"
# Fields to keep.
fields = ["skills", "industry", "summary", "education", "experience"]
# Clean profiles in JSON Lines format file.
with open(path_to_file, encoding='UTF8') as inf, \
open(cleaned_file, 'w', encoding='UTF8') as outf:
for line in inf:
profile = json.loads(line) # Read a profile object.
for key in list(profile.keys()): # Remove unwanted fields it.
if key not in fields:
del profile[key]
outf.write(json.dumps(profile) + '\n') # Write cleaned profile to new file
# Test whether it worked.
with open(cleaned_file, encoding='UTF8') as cleaned:
for line in cleaned:
profile = json.loads(line)
print(json.dumps(profile, indent=4))