问题
Are graph databases more performant than relational databases for highly connected acyclic graph data?
I need to significantly speed up my query results and hope that graph databases will be the answer. I had seen significant improvement in my relational database queries when I used Common Table Extensions bringing a recursive search of my sample data from 16 hours to 30 minutes. Still, 30 minutes is way too long for a web application and trying to work around that kind of response gets rather ridiculous pretty quickly relying on caching.
My Gremlin query looks something like:
g.withSack(100D).
V(with vertex id).
repeat(out('edge_label').
sack(div).by(constant(2D))).
emit().
group().by('node_property').by(sack().sum()).
unfold().
order().by(values,decr).
fold()
a Cypher equivalent (thank you cyberSam) something like:
MATCH p=(f:Foo)-[:edge_label*]->(g)
WHERE f.id = 123
RETURN g, SUM(100*0.5^(LENGTH(p)-1)) AS weight
ORDER BY weight DESC
and my SQL roughly like:
WITH PctCTE(id, pt, tipe, ct)
AS
(SELECT id, CONVERT(DECIMAL(28,25),100.0) AS pt, kynd, 1
FROM db.reckrd parent
WHERE parent.id = @id
UNION ALL
SELECT child.id, CONVERT(DECIMAL(28,25),parent.pt/2.0), child.kynd, parent.ct+1
FROM db.reckrd AS child
INNER JOIN PctCTE AS parent
ON (parent.tipe = 'M' AND
(child .emm = parent.id))
OR
(NOT parent.tipe = 'M' AND
(child .not_emm = parent.id))
),
mergeCTE(dups, h, p)
AS
(SELECT ROW_NUMBER () OVER (PARTITION BY id ORDER BY ct) 'dups', id, SUM(pt) OVER (PARTITION BY id)
FROM PctCTE
)
which should return a result set with 500,000+ edges in my test instance.
If I filtered to reduce the size of the output, it would still have to be after traversing all of those edges first for me to get to the interesting stuff I want to analyse.
I can foresee some queries on real data getting closer to having to traverse 3,000,000+ edges ...
If graph databases aren't the answer, is a CTE as good as it gets?
回答1:
[UPDATED]
When using neo4j, here is a roughly equivalent Cypher query, which uses a variable-length relationship pattern:
MATCH p=(f:Foo)-[:edge_label*]->(g)
WHERE f.id = 123
RETURN g, SUM(100*0.5^(LENGTH(p)-1)) AS weight
ORDER BY weight DESC
A variable-length relationship has exponential complexity. If the average degree-ness is D
and the maximum depth is N
, then you should expect a complexity of about O(D^N)
. With your use case, that is on the order of about 4^30 operations.
However, since in your use case a node's contribution to its total weight decreases exponentially by its depth in a given path, you could get a close approximation to the actual result by simply ignoring nodes that are beyond a threshold depth.
For example, a node at a depth of 8 would only add 0.0078 to its total weight. And the complexity at that depth would only be 4^8 (or 65K), which should be reasonably fast. The Cypher query for a max depth of 8 would be only slightly different:
MATCH p=(f:Foo)-[:edge_label*..8]->(g)
WHERE f.id = 123
RETURN g, SUM(100*0.5^(LENGTH(p)-1)) AS weight
ORDER BY weight DESC
回答2:
I tried JanusGraph-0.5.2 with BerkeleyDB Java Edition. My sample data set has 580832 vertices, 2325896 edges loaded from a roughly 1 gb graphML file. The network average degree is 4, diameter 30, average path length 1124, modularity 0.7, average clustering coefficient 0.013 and eigenvector centrality (100 iterations) of 4.5.
No doubt I am doing my query rather amatuerishly, but after waiting 10 hours only to receive a Java stack out of memory error, it is clear that my CTE performance is at least 20 times faster!!!
My conf/janusgraph-berkeleyje.properties file included the following settings:
gremlin.graph = org.janusgraph.core.JanusGraphFactory
storage.backent = berkeleyje
storage.directory = ../db/berkeley
cache.db-cache = true
cache.db-cache-size = 0.5
cache.db-cache-time = 0
cache.tx-cache-size = 20000
cache.db-cache-clean-wait = 0
storage.transaction = false
storage.berkeleyje.cache-percentage = 65
At this stage in my investigation, it would appear that CTE's are at least an order of magnitude more performant on heavily recursive queries than graph databases. I would love to be wrong...
来源:https://stackoverflow.com/questions/63112168/graph-database-or-relational-database-common-table-extensions-comparing-acyclic