NoSQL Solution for Persisting Graphs at Scale

后端 未结 3 1072
抹茶落季
抹茶落季 2021-01-29 22:03

I\'m hooked on using Python and NetworkX for analyzing graphs and as I learn more I want to use more and more data (guess I\'m becoming a data junkie :-). Eventually I think my

3条回答
  •  闹比i
    闹比i (楼主)
    2021-01-29 22:34

    There are two general types of containers for storing graphs:

    1. true graph databases: e.g., Neo4J, agamemnon, GraphDB, and AllegroGraph; these not only store a graph but they also understand that a graph is, so for instance, you can query these databases e.g., how many nodes are between the shortest path from node X and node Y?

    2. static graph containers: Twitter's MySQL-adapted FlockDB is the most well-known exemplar here. These DBs can store and retrieve graphs just fine; but to query the graph itself, you have to first retrieve the graph from the DB then use a library (e.g., Python's excellent Networkx) to query the graph itself.

    The redis-based graph container i discuss below is in the second category, though apparently redis is also well-suited for containers in the first category as evidenced by redis-graph, a remarkably small python package for implementing a graph database in redis.

    redis will work beautifully here.

    Redis is a heavy-duty, durable data store suitable for production use, yet it's also simple enough to use for command-line analysis.

    Redis is different than other databases in that it has multiple data structure types; the one i would recommend here is the hash data type. Using this redis data structure allows you to very closely mimic a "list of dictionaries", a conventional schema for storing graphs, in which each item in the list is a dictionary of edges keyed to the node from which those edges originate.

    You need to first install redis and the python client. The DeGizmo Blog has an excellent "up-and-running" tutorial which includes a step-by-step guid on installing both.

    Once redis and its python client are installed, start a redis server, which you do like so:

    • cd to the directory in which you installed redis (/usr/local/bin on 'nix if you installed via make install); next

    • type redis-server at the shell prompt then enter

    you should now see the server log file tailing on your shell window

    >>> import numpy as NP
    >>> import networkx as NX
    
    >>> # start a redis client & connect to the server:
    >>> from redis import StrictRedis as redis
    >>> r1 = redis(db=1, host="localhost", port=6379)
    

    In the snippet below, i have stored a four-node graph; each line below calls hmset on the redis client and stores one node and the edges connected to that node ("0" => no edge, "1" => edge). (In practice, of course, you would abstract these repetitive calls in a function; here i'm showing each call because it's likely easier to understand that way.)

    >>> r1.hmset("n1", {"n1": 0, "n2": 1, "n3": 1, "n4": 1})
          True
    
    >>> r1.hmset("n2", {"n1": 1, "n2": 0, "n3": 0, "n4": 1})
          True
    
    >>> r1.hmset("n3", {"n1": 1, "n2": 0, "n3": 0, "n4": 1})
          True
    
    >>> r1.hmset("n4", {"n1": 0, "n2": 1, "n3": 1, "n4": 1})
          True
    
    >>> # retrieve the edges for a given node:
    >>> r1.hgetall("n2")
          {'n1': '1', 'n2': '0', 'n3': '0', 'n4': '1'}
    

    Now that the graph is persisted, retrieve it from the redis DB as a NetworkX graph.

    There are many ways to do this, below did it in two *steps*:

    1. extract the data from the redis database into an adjacency matrix, implemented as a 2D NumPy array; then

    2. convert that directly to a NetworkX graph using a NetworkX built-in function:

    reduced to code, these two steps are:

    >>> AM = NP.array([map(int, r1.hgetall(node).values()) for node in r1.keys("*")])
    >>> # now convert this adjacency matrix back to a networkx graph:
    >>> G = NX.from_numpy_matrix(am)
    
    >>> # verify that G in fact holds the original graph:
    >>> type(G)
          
    >>> G.nodes()
          [0, 1, 2, 3]
    >>> G.edges()
          [(0, 1), (0, 2), (0, 3), (1, 3), (2, 3), (3, 3)]
    

    When you end a redis session, you can shut down the server from the client like so:

    >>> r1.shutdown()
    

    redis saves to disk just before it shuts down so this is a good way to ensure all writes were persisted.

    So where is the redis DB? It is stored in the default location with the default file name, which is dump.rdb on your home directory.

    To change this, edit the redis.conf file (included with the redis source distribution); go to the line starting with:

    # The filename where to dump the DB
    dbfilename dump.rdb
    

    change dump.rdb to anything you wish, but leave the .rdb extension in place.

    Next, to change the file path, find this line in redis.conf:

    # Note that you must specify a directory here, not a file name
    

    The line below that is the directory location for the redis database. Edit it so that it recites the location you want. Save your revisions and rename this file, but keep the .conf extension. You can store this config file anywhere you wish, just provide the full path and name of this custom config file on the same line when you start a redis server:

    So the next time you start a redis server, you must do it like so (from the shell prompt:

    $> cd /usr/local/bin    # or the directory in which you installed redis 
    
    $> redis-server /path/to/redis.conf
    

    Finally, the Python Package Index lists a package specifically for implementing a graph database in redis. The package is called redis-graph and i have not used it.

提交回复
热议问题