MySQL / PHP: Find similar / related items by tag / taxonomy

后端 未结 5 507
太阳男子
太阳男子 2020-12-29 08:37

I have a cities table which looks like this.

|id| Name    |
|1 | Paris   |
|2 | London  |
|3 | New York|

I have a tags table which looks li

5条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-12-29 09:12

    Too late, but I think that none of answers are fully correct. I got the best part of each one and put all together to make my own answer:

    • The Jaccard Index explanaiton of @m-khalid-junaid is very interesting and correct, but the implementation of (q.sets + q.parisset) AS union and (q.sets - q.parisset) AS intersect is very wrong.
    • The version of @n-lx is the way, but needs the Jaccard Index, this is very important, if a city have 2 tags and matches two tags of another city with 3 tags, the result will be the same of the matches on another city with only the same two tags. I think the full matches is most related.

    My answer:

    cities table like this.

    | id | Name      |
    | 1  | Paris     |
    | 2  | Florence  |
    | 3  | New York  |
    | 4  | São Paulo |
    | 5  | London    |
    

    cities_tag table like this.

    | city_id | tag_id |
    | 1       | 1      | 
    | 1       | 3      | 
    | 2       | 1      |
    | 2       | 3      | 
    | 3       | 1      |     
    | 3       | 2      |
    | 4       | 2      |     
    | 5       | 1      |
    | 5       | 2      |
    | 5       | 3      |
    

    With this sample data, Florence have a full matches with Paris, New York matches one tag, São Paulo have no tags matches and London matches two tags and have another one. I think the Jaccard Index of this sample is:

    Florence: 1.000 (2/2)

    London: 0.666 (2/3)

    New York: 0.333 (1/3)

    São Paulo: 0.000 (0/3)

    My query is like this:

    select jaccard.city, 
           jaccard.intersect, 
           jaccard.union, 
           jaccard.intersect/jaccard.union as 'jaccard index'
    from 
    (select
        c2.name as city
        ,count(ct2.tag_id) as 'intersect' 
        ,(select count(distinct ct3.tag_id) 
          from cities_tags ct3 
          where ct3.city_id in(c1.id, c2.id)) as 'union'
    from
        cities as c1
        inner join cities as c2 on c1.id != c2.id
        left join cities_tags as ct1 on ct1.city_id = c1.id
        left join cities_tags as ct2 on ct2.city_id = c2.id and ct1.tag_id = ct2.tag_id
    where c1.id = 1
    group by c1.id, c2.id) as jaccard
    order by jaccard.intersect/jaccard.union desc
    

    SQL Fidde

提交回复
热议问题