Persistent multiprocess shared cache in Python with stdlib or minimal dependencies

前端 未结 3 547
有刺的猬
有刺的猬 2021-02-06 10:00

I just tried a Python shelve module as the persistent cache for data fetched from the external service. The complete example is here.

I was wondering what would the best

相关标签:
3条回答
  • 2021-02-06 10:32

    Let's consider your requirements systematically:

    minimum or no external dependencies

    Your use case will determine if you can use in-band (file descriptor or memory region inherited across fork) or out-of-band synchronisation (posix file locks, sys V shared memory).

    Then you may have other requirements, e.g. cross-platform availability of the tools, etc.

    There really isn't that much in the standard library, except bare tools. One module however, stands out, sqlite3. Sqlite uses fcntl/posix locks, there are performance limitations though, multiple processes imply file-backed database, and sqlite requires fdatasync on commit.

    Thus there's a limit to transactions/s in sqlite imposed by your hard drive rpm. The latter is not a big deal if you have hw raid, but can be a major handicap on commodity hardware, e.g. a laptop or usb flash or sd card. Plan for ~100tps if you use a regular, rotating hard drive.

    Your processes can also block on sqlite, if you use special transaction modes.

    preventing thundering herd

    There are two major approaches for this:

    • probabilistically refresh cache item earlier than required, or
    • refresh only when required but block other callers

    Presumably if you trust another process with the cache value, you don't have any security considerations. Thus either will work, or perhaps a combination of both.

    0 讨论(0)
  • 2021-02-06 10:44

    I wrote a locking (thread- and mulitprocess-safe) wrapper around the standard shelve module with no external dependencies:

    https://github.com/cristoper/shelfcache

    It meets many of your requirements, but it does not have any sort of backoff strategy to prevent thundering herds, and if you want Reader-Writer lock (so that multiple threads can read, but only one write) you have to provide you own RW lock.

    However, if I were to do it again I'd probably "just use sqlite". The shelve module which abstracts over several different dbm implementations, which themselves abstract over various OS locking mechanisms, is a pain (using the shelfcache flock option with gdbm on Mac OS X (or busybox), for example, results in a deadlock).

    There are several python projects which try to provide a standard dict interface to sqlite or other persistent stores, ex: https://github.com/RaRe-Technologies/sqlitedict

    (Note that sqldict is thread safe even for the same database connection, but it is not safe to share the same database connection between processes.)

    0 讨论(0)
  • 2021-02-06 10:47

    I'd say you'd want to use some existing caching library, dogpile.cache comes to mind, it has many features already, and you can easily plug in the backends you might need.

    dogpile.cache documentation tells the following:

    This “get-or-create” pattern is the entire key to the “Dogpile” system, which coordinates a single value creation operation among many concurrent get operations for a particular key, eliminating the issue of an expired value being redundantly re-generated by many workers simultaneously.

    0 讨论(0)
提交回复
热议问题