Key: value store in Python for possibly 100 GB of data, without client/server

前端 未结 6 612
猫巷女王i
猫巷女王i 2020-12-25 14:26

There are many solutions to serialize a small dictionary: json.loads/json.dumps, pickle, shelve, ujson, or e

6条回答
  •  孤城傲影
    2020-12-25 15:20

    I would consider HDF5 for this. It has several advantages:

    • Usable from many programming languages.
    • Usable from Python via the excellent h5py package.
    • Battle tested, including with large data sets.
    • Supports variable-length string values.
    • Values are addressable by a filesystem-like "path" (/foo/bar).
    • Values can be arrays (and usually are), but do not have to be.
    • Optional built-in compression.
    • Optional "chunking" to allow writing chunks incrementally.
    • Does not require loading the entire data set into memory at once.

    It does have some disadvantages too:

    • Extremely flexible, to the point of making it hard to define a single approach.
    • Complex format, not feasible to use without the official HDF5 C library (but there are many wrappers, e.g. h5py).
    • Baroque C/C++ API (the Python one is not so).
    • Little support for concurrent writers (or writer + readers). Writes might need to lock at a coarse granularity.

    You can think of HDF5 as a way to store values (scalars or N-dimensional arrays) inside a hierarchy inside a single file (or indeed multiple such files). The biggest problem with just storing your values in a single disk file would be that you'd overwhelm some filesystems; you can think of HDF5 as a filesystem within a file which won't fall down when you put a million values in one "directory."

提交回复
热议问题