I\'m interested in calculating the Jaccard similarity metric for all pairs of vertices in a graph that are not directly connected. The Jaccard metric is defined as the norm of
Let's do it step by step:
Find pairs of vertices and also collect their respective neighbors:
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', neq('v2'))
Make sure that v1
is not a neighbor of v2
and vice versa:
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', neq('v2').and(without('v2n'))).
where('v2', without('v1n'))
Next, compute the number of intersecting neighbors and the total number of neighbors:
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', neq('v2').and(without('v2n'))).
where('v2', without('v1n')).as('m').
project('v1','v2','i','u').
by(select('v1')).
by(select('v2')).
by(select('v1n').as('n').
select('m').
select('v2n').unfold().
where(within('n')).
count()).
by(union(select('v1n'),
select('v2n')).unfold().
dedup().count())
And finally, compute the Jaccard similarity by dividing i
by u
(also make sure that vertices without neighbors get filtered out to prevent divisions by 0):
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', neq('v2').and(without('v2n'))).
where('v2', without('v1n')).as('m').
project('v1','v2','i','u').
by(select('v1')).
by(select('v2')).
by(select('v1n').as('n').
select('m').
select('v2n').unfold().
where(within('n')).
count()).
by(union(select('v1n'),
select('v2n')).unfold().
dedup().count()).
filter(select('u').is(gt(0))).
project('v1','v2','j').
by(select('v1')).
by(select('v2')).
by(math('i/u'))
One last thing: Since comparing vertex v1
and v2
is the same as comparing v2
and v1
, the query only needs to consider one case. One way to do that is by making sure that v1
's id is smaller than v2
's id:
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', lt('v2')).
by(id).
where('v1', without('v2n')).
where('v2', without('v1n')).as('m').
project('v1','v2','i','u').
by(select('v1')).
by(select('v2')).
by(select('v1n').as('n').
select('m').
select('v2n').unfold().
where(within('n')).
count()).
by(union(select('v1n'),
select('v2n')).unfold().
dedup().count()).
filter(select('u').is(gt(0))).
project('v1','v2','j').
by(select('v1')).
by(select('v2')).
by(math('i/u'))
Executing this traversal over the modern toy graph yields the following result:
gremlin> g = TinkerFactory.createModern().traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V().match(
......1> __.as('v1').out().dedup().fold().as('v1n'),
......2> __.as('v1').V().as('v2'),
......3> __.as('v2').out().dedup().fold().as('v2n')).
......4> where('v1', lt('v2')).
......5> by(id).
......6> where('v1', without('v2n')).
......7> where('v2', without('v1n')).as('m').
......8> project('v1','v2','i','u').
......9> by(select('v1')).
.....10> by(select('v2')).
.....11> by(select('v1n').as('n').
.....12> select('m').
.....13> select('v2n').unfold().
.....14> where(within('n')).
.....15> count()).
.....16> by(union(select('v1n'),
.....17> select('v2n')).unfold().
.....18> dedup().count()).
.....19> filter(select('u').is(gt(0))).
.....20> project('v1','v2','j').
.....21> by(select('v1')).
.....22> by(select('v2')).
.....23> by(math('i/u'))
==>[v1:v[1],v2:v[5],j:0.0]
==>[v1:v[1],v2:v[6],j:0.3333333333333333]
==>[v1:v[2],v2:v[4],j:0.0]
==>[v1:v[2],v2:v[6],j:0.0]
==>[v1:v[4],v2:v[6],j:0.5]
==>[v1:v[5],v2:v[6],j:0.0]