Compressing A Series of JSON Objects While Maintaining Serial Reading?

前端未结

关注

 2  987

I have a bunch of json objects that I need to compress as it\'s eating too much disk space, approximately 20 gigs worth for a few million of them.

相关标签:

2条回答

被撕碎了的回忆

2020-12-30 16:30
You might want to try an incremental json parser, such as jsaone.

That is, create a single json with all your objects, and parse it like
```
with gzip.GzipFile(file_path, 'r') as f_in:
    for key, val in jsaone.load(f_in):
        ...
```
This is quite similar to Martin's answer, wasting slightly more space but maybe slightly more comfortable.

EDIT: oh, by the way, it's probably fair to clarify that I wrote jsaone.
0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2020-12-30 16:40
Just use a gzip.GzipFile() object and treat it like a regular file; write JSON objects line by line, and read them line by line.

The object takes care of compression transparently, and will buffer reads, decompressing chucks as needed.
```
import gzip
import json

# writing
with gzip.GzipFile(jsonfilename, 'w') as outfile:
    for obj in objects:
        outfile.write(json.dumps(obj) + '\n')

# reading
with gzip.GzipFile(jsonfilename, 'r') as infile:
    for line in infile:
        obj = json.loads(line)
        # process obj
```
This has the added advantage that the compression algorithm can make use of repetition across objects for compression ratios.
0 讨论(0)
发布评论:

提交评论
- 加载中...