gremlin syntax to calculate Jaccard similarity metric

前端 未结 1 1350
生来不讨喜
生来不讨喜 2021-01-28 19:04

I\'m interested in calculating the Jaccard similarity metric for all pairs of vertices in a graph that are not directly connected. The Jaccard metric is defined as the norm of

相关标签:
1条回答
  • 2021-01-28 19:34

    Let's do it step by step:

    Find pairs of vertices and also collect their respective neighbors:

    g.V().match(
          __.as('v1').out().dedup().fold().as('v1n'),
          __.as('v1').V().as('v2'),
          __.as('v2').out().dedup().fold().as('v2n')).
        where('v1', neq('v2'))
    

    Make sure that v1 is not a neighbor of v2 and vice versa:

    g.V().match(
          __.as('v1').out().dedup().fold().as('v1n'),
          __.as('v1').V().as('v2'),
          __.as('v2').out().dedup().fold().as('v2n')).
        where('v1', neq('v2').and(without('v2n'))).
        where('v2', without('v1n'))
    

    Next, compute the number of intersecting neighbors and the total number of neighbors:

    g.V().match(
          __.as('v1').out().dedup().fold().as('v1n'),
          __.as('v1').V().as('v2'),
          __.as('v2').out().dedup().fold().as('v2n')).
        where('v1', neq('v2').and(without('v2n'))).
        where('v2', without('v1n')).as('m').
      project('v1','v2','i','u').
        by(select('v1')).
        by(select('v2')).
        by(select('v1n').as('n').
           select('m').
           select('v2n').unfold().
             where(within('n')).
           count()).
        by(union(select('v1n'),
                 select('v2n')).unfold().
           dedup().count())
    

    And finally, compute the Jaccard similarity by dividing i by u (also make sure that vertices without neighbors get filtered out to prevent divisions by 0):

    g.V().match(
          __.as('v1').out().dedup().fold().as('v1n'),
          __.as('v1').V().as('v2'),
          __.as('v2').out().dedup().fold().as('v2n')).
        where('v1', neq('v2').and(without('v2n'))).
        where('v2', without('v1n')).as('m').
      project('v1','v2','i','u').
        by(select('v1')).
        by(select('v2')).
        by(select('v1n').as('n').
           select('m').
           select('v2n').unfold().
             where(within('n')).
           count()).
        by(union(select('v1n'),
                 select('v2n')).unfold().
           dedup().count()).
      filter(select('u').is(gt(0))).
      project('v1','v2','j').
        by(select('v1')).
        by(select('v2')).
        by(math('i/u'))
    

    One last thing: Since comparing vertex v1 and v2 is the same as comparing v2 and v1, the query only needs to consider one case. One way to do that is by making sure that v1's id is smaller than v2's id:

    g.V().match(
          __.as('v1').out().dedup().fold().as('v1n'),
          __.as('v1').V().as('v2'),
          __.as('v2').out().dedup().fold().as('v2n')).
        where('v1', lt('v2')).
          by(id).
        where('v1', without('v2n')).
        where('v2', without('v1n')).as('m').
      project('v1','v2','i','u').
        by(select('v1')).
        by(select('v2')).
        by(select('v1n').as('n').
           select('m').
           select('v2n').unfold().
             where(within('n')).
           count()).
        by(union(select('v1n'),
                 select('v2n')).unfold().
           dedup().count()).
      filter(select('u').is(gt(0))).
      project('v1','v2','j').
        by(select('v1')).
        by(select('v2')).
        by(math('i/u'))
    

    Executing this traversal over the modern toy graph yields the following result:

    gremlin> g = TinkerFactory.createModern().traversal()
    ==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
    gremlin> g.V().match(
    ......1>       __.as('v1').out().dedup().fold().as('v1n'),
    ......2>       __.as('v1').V().as('v2'),
    ......3>       __.as('v2').out().dedup().fold().as('v2n')).
    ......4>     where('v1', lt('v2')).
    ......5>       by(id).
    ......6>     where('v1', without('v2n')).
    ......7>     where('v2', without('v1n')).as('m').
    ......8>   project('v1','v2','i','u').
    ......9>     by(select('v1')).
    .....10>     by(select('v2')).
    .....11>     by(select('v1n').as('n').
    .....12>        select('m').
    .....13>        select('v2n').unfold().
    .....14>          where(within('n')).
    .....15>        count()).
    .....16>     by(union(select('v1n'),
    .....17>              select('v2n')).unfold().
    .....18>        dedup().count()).
    .....19>   filter(select('u').is(gt(0))).
    .....20>   project('v1','v2','j').
    .....21>     by(select('v1')).
    .....22>     by(select('v2')).
    .....23>     by(math('i/u'))
    ==>[v1:v[1],v2:v[5],j:0.0]
    ==>[v1:v[1],v2:v[6],j:0.3333333333333333]
    ==>[v1:v[2],v2:v[4],j:0.0]
    ==>[v1:v[2],v2:v[6],j:0.0]
    ==>[v1:v[4],v2:v[6],j:0.5]
    ==>[v1:v[5],v2:v[6],j:0.0]
    
    0 讨论(0)
提交回复
热议问题