How to avoid computation every time a python module is reloaded

后端 未结 13 700
温柔的废话
温柔的废话 2021-02-06 10:55

I have a python module that makes use of a huge dictionary global variable, currently I put the computation code in the top section, every first time import or reload of the mod

相关标签:
13条回答
  • 2021-02-06 11:10

    If the 'shelve' solution turns out to be too slow or fiddly, there are other possibilities:

    • shove
    • Durus
    • ZopeDB
    • pyTables
    0 讨论(0)
  • 2021-02-06 11:13
    1. Factor the computationally intensive part into a separate module. Then at least on reload, you won't have to wait.

    2. Try dumping the data structure using protocol 2. The command to try would be cPickle.dump(FD, protocol=2). From the docstring for cPickle.Pickler:

      Protocol 0 is the
      only protocol that can be written to a file opened in text
      mode and read back successfully.  When using a protocol higher
      than 0, make sure the file is opened in binary mode, both when
      pickling and unpickling. 
      
    0 讨论(0)
  • 2021-02-06 11:15

    I'm going through this same issue... shelve, databases, etc... are all too slow for this type of problem. You'll need to take the hit once, insert it into an inmemory key/val store like Redis. It will just live there in memory (warning it could use up a good amount of memory so you may want a dedicated box). You'll never have to reload it and you'll just get looking in memory for keys

    r = Redis()
    r.set(key, word)
    
    word = r.get(key)
    
    0 讨论(0)
  • 2021-02-06 11:18

    OR you could just use a database for storing the values in? Check out SQLObject, which makes it very easy to store stuff to a database.

    0 讨论(0)
  • 2021-02-06 11:20

    Just to clarify: the code in the body of a module is not executed every time the module is imported - it is run only once, after which future imports find the already created module, rather than recreating it. Take a look at sys.modules to see the list of cached modules.

    However, if your problem is the time it takes for the first import after the program is run, you'll probably need to use some other method than a python dict. Probably best would be to use an on-disk form, for instance a sqlite database, one of the dbm modules.

    For a minimal change in your interface, the shelve module may be your best option - this puts a pretty transparent interface between the dbm modules that makes them act like an arbitrary python dict, allowing any picklable value to be stored. Here's an example:

    # Create dict with a million items:
    import shelve
    d = shelve.open('path/to/my_persistant_dict')
    d.update(('key%d' % x, x) for x in xrange(1000000))
    d.close()
    

    Then in the next process, use it. There should be no large delay, as lookups are only performed for the key requested on the on-disk form, so everything doesn't have to get loaded into memory:

    >>> d = shelve.open('path/to/my_persistant_dict')
    >>> print d['key99999']
    99999
    

    It's a bit slower than a real dict, and it will still take a long time to load if you do something that requires all the keys (eg. try to print it), but may solve your problem.

    0 讨论(0)
  • 2021-02-06 11:20

    shelve gets really slow with large data sets. I've been using redis quite successfully, and wrote a FreqDist wrapper around it. It's very fast, and can be accessed concurrently.

    0 讨论(0)
提交回复
热议问题