Graph Database or Relational Database Common Table Extensions: Comparing acyclic graph query performance

问题

Are graph databases more performant than relational databases for highly connected acyclic graph data?

I need to significantly speed up my query results and hope that graph databases will be the answer. I had seen significant improvement in my relational database queries when I used Common Table Extensions bringing a recursive search of my sample data from 16 hours to 30 minutes. Still, 30 minutes is way too long for a web application and trying to work around that kind of response gets rather ridiculous pretty quickly relying on caching.

My Gremlin query looks something like:

g.withSack(100D).
V(with vertex id).
repeat(out('edge_label').
sack(div).by(constant(2D))).
emit().
group().by('node_property').by(sack().sum()).
unfold().
order().by(values,decr).
fold()

a Cypher equivalent (thank you cyberSam) something like:

MATCH p=(f:Foo)-[:edge_label*]->(g)
WHERE f.id = 123
RETURN g, SUM(100*0.5^(LENGTH(p)-1)) AS weight
ORDER BY weight DESC

and my SQL roughly like:

WITH PctCTE(id, pt, tipe, ct)
AS
    (SELECT id, CONVERT(DECIMAL(28,25),100.0) AS pt, kynd, 1 
        FROM db.reckrd parent
        WHERE parent.id = @id
    UNION ALL
        SELECT child.id, CONVERT(DECIMAL(28,25),parent.pt/2.0), child.kynd, parent.ct+1
        FROM db.reckrd AS child
        INNER JOIN PctCTE AS parent
        ON (parent.tipe = 'M' AND
        (child .emm = parent.id))
        OR
        (NOT parent.tipe = 'M' AND
        (child .not_emm = parent.id))
    ),
    mergeCTE(dups, h, p)
    AS
        (SELECT ROW_NUMBER () OVER (PARTITION BY id ORDER BY ct) 'dups', id, SUM(pt) OVER (PARTITION BY id)
        FROM PctCTE
        )

which should return a result set with 500,000+ edges in my test instance.

If I filtered to reduce the size of the output, it would still have to be after traversing all of those edges first for me to get to the interesting stuff I want to analyse.

I can foresee some queries on real data getting closer to having to traverse 3,000,000+ edges ...

If graph databases aren't the answer, is a CTE as good as it gets?

回答1:

[UPDATED]

When using neo4j, here is a roughly equivalent Cypher query, which uses a variable-length relationship pattern:

MATCH p=(f:Foo)-[:edge_label*]->(g)
WHERE f.id = 123
RETURN g, SUM(100*0.5^(LENGTH(p)-1)) AS weight
ORDER BY weight DESC

A variable-length relationship has exponential complexity. If the average degree-ness is D and the maximum depth is N, then you should expect a complexity of about O(D^N). With your use case, that is on the order of about 4^30 operations.

However, since in your use case a node's contribution to its total weight decreases exponentially by its depth in a given path, you could get a close approximation to the actual result by simply ignoring nodes that are beyond a threshold depth.

For example, a node at a depth of 8 would only add 0.0078 to its total weight. And the complexity at that depth would only be 4^8 (or 65K), which should be reasonably fast. The Cypher query for a max depth of 8 would be only slightly different:

MATCH p=(f:Foo)-[:edge_label*..8]->(g)
WHERE f.id = 123
RETURN g, SUM(100*0.5^(LENGTH(p)-1)) AS weight
ORDER BY weight DESC

回答2:

I tried JanusGraph-0.5.2 with BerkeleyDB Java Edition. My sample data set has 580832 vertices, 2325896 edges loaded from a roughly 1 gb graphML file. The network average degree is 4, diameter 30, average path length 1124, modularity 0.7, average clustering coefficient 0.013 and eigenvector centrality (100 iterations) of 4.5.

No doubt I am doing my query rather amatuerishly, but after waiting 10 hours only to receive a Java stack out of memory error, it is clear that my CTE performance is at least 20 times faster!!!

My conf/janusgraph-berkeleyje.properties file included the following settings:

gremlin.graph = org.janusgraph.core.JanusGraphFactory
storage.backent = berkeleyje
storage.directory = ../db/berkeley
cache.db-cache = true
cache.db-cache-size = 0.5
cache.db-cache-time = 0
cache.tx-cache-size = 20000
cache.db-cache-clean-wait = 0
storage.transaction = false
storage.berkeleyje.cache-percentage = 65

At this stage in my investigation, it would appear that CTE's are at least an order of magnitude more performant on heavily recursive queries than graph databases. I would love to be wrong...

来源：https://stackoverflow.com/questions/63112168/graph-database-or-relational-database-common-table-extensions-comparing-acyclic

标签

neo4j

common-table-expression

graph-databases

janusgraph

scylla