How can I speed up unpickling large objects if I have plenty of RAM?

前端 未结 8 852
别那么骄傲
别那么骄傲 2020-12-09 09:07

It\'s taking me up to an hour to read a 1-gigabyte NetworkX graph data structure using cPickle (its 1-GB when stored on disk as a binary pickle file).

Note that the

相关标签:
8条回答
  • 2020-12-09 09:29

    why don't you use pickle.load?

    f = open('fname', 'rb')
    graph = pickle.load(f)
    
    0 讨论(0)
  • 2020-12-09 09:30

    You're probably bound by Python object creation/allocation overhead, not the unpickling itself. If so, there is little you can do to speed this up, except not creating all the objects. Do you need the entire structure at once? If not, you could use lazy population of the data structure (for example: represent parts of the structure by pickled strings, then unpickle them only when they are accessed).

    0 讨论(0)
  • 2020-12-09 09:34

    I'm also trying to speed up the loading/storing of networkx graphs. I'm using the adjacency_graph method to convert the graph to something serialisable, see for instance this code:

    from networkx.generators import fast_gnp_random_graph
    from networkx.readwrite import json_graph
    
    G = fast_gnp_random_graph(4000, 0.7)
    
    with open('/tmp/graph.pickle', 'wb+') as f:
      data = json_graph.adjacency_data(G)
      pickle.dump(data, f)
    
    with open('/tmp/graph.pickle', 'rb') as f:
      d = pickle.load(f)
      H = json_graph.adjacency_graph(d)
    

    However, this adjacency_graph conversion method is quite slow, so time gained in pickling is probably lost on converting.

    So this actually doesn't speed things up, bummer. Running this code gives the following timings:

    N=1000
    
        0.666s ~ generating
        0.790s ~ converting
        0.237s ~ storing
        0.295s ~ loading
        1.152s ~ converting
    
    N=2000
    
        2.761s ~ generating
        3.282s ~ converting
        1.068s ~ storing
        1.105s ~ loading
        4.941s ~ converting
    
    N=3000
    
        6.377s ~ generating
        7.644s ~ converting
        2.464s ~ storing
        2.393s ~ loading
        12.219s ~ converting
    
    N=4000
    
        12.458s ~ generating
        19.025s ~ converting
        8.825s ~ storing
        8.921s ~ loading
        27.601s ~ converting
    

    This exponential growth is probably due to the graph getting exponentially more edges. Here is a test gist, in case you want to try yourself

    https://gist.github.com/wires/5918834712a64297d7d1

    0 讨论(0)
  • 2020-12-09 09:37

    In general, I've found that if possible, when saving large objects to disk in python, it's much more efficient to use numpy ndarrays or scipy.sparse matrices.

    Thus for huge graphs like the one in the example, I could convert the graph to a scipy sparse matrix (networkx has a function that does this, and it's not hard to write one), and then save that sparse matrix in binary format.

    0 讨论(0)
  • 2020-12-09 09:45

    I had great success in reading a ~750 MB igraph data structure (a binary pickle file) using cPickle itself. This was achieved by simply wrapping up the pickle load call as mentioned here

    Example snippet in your case would be something like:

    import cPickle as pickle
    import gc
    
    f = open("bigNetworkXGraph.pickle", "rb")
    
    # disable garbage collector
    gc.disable()
    
    graph = pickle.load(f)
    
    # enable garbage collector again
    gc.enable()
    f.close()
    

    This definitely isn't the most apt way to do it, however, it reduces the time required drastically.
    (For me, it reduced from 843.04s to 41.28s, around 20x)

    0 讨论(0)
  • 2020-12-09 09:52

    Why don't you try marshaling your data and storing it in RAM using memcached (for example). Yes, it has some limitations but as this points out marshaling is way faster (20 to 30 times) than pickling.

    Of course, you should also spend as much time optimizing your data structure in order to minimize the amount and complexity of data you want stored.

    0 讨论(0)
提交回复
热议问题