MySQL / PHP: Find similar / related items by tag / taxonomy

后端 未结 5 505
太阳男子
太阳男子 2020-12-29 08:37

I have a cities table which looks like this.

|id| Name    |
|1 | Paris   |
|2 | London  |
|3 | New York|

I have a tags table which looks li

5条回答
  •  礼貌的吻别
    2020-12-29 09:01

    select c.name, cnt.val/(select count(*) from cities) as jaccard_index
    from cities c 
    inner join 
      (
      select city_id, count(*) as val 
      from cities_tags 
      where tag_id in (select tag_id from cities_tags where city_id=1) 
      and not city_id in (1)
      group by city_id
      ) as cnt 
    on c.id=cnt.city_id
    order by jaccard_index desc
    

    This query is statically referring to city_id=1, so you'll have to make that a variable in both the where tag_id in clause, and the not city_id in clause.

    If I understood the Jaccard index properly, then it also returns that value ordered by the 'most closely related'. The results in our example look like this:

    |name      |jaccard_index  |
    |London    |0.6667         |
    |New York  |0.3333         |
    

    Edit

    With a better understanding of how to implement Jaccard Index:

    After reading a bit more on wikipedia about the Jaccard Index, I've come up with a better way implement a query for our example dataset. Essentially, we will be comparing our chosen city with each other city in the list independently, and using the count of common tags divided by the count of distinct total tags selected between the two cities.

    select c.name, 
      case -- when this city's tags are a subset of the chosen city's tags
        when not_in.cnt is null 
      then -- then the union count is the chosen city's tag count
        intersection.cnt/(select count(tag_id) from cities_tags where city_id=1) 
      else -- otherwise the union count is the chosen city's tag count plus everything not in the chosen city's tag list
        intersection.cnt/(not_in.cnt+(select count(tag_id) from cities_tags where city_id=1)) 
      end as jaccard_index
      -- Jaccard index is defined as the size of the intersection of a dataset, divided by the size of the union of a dataset
    from cities c 
    inner join 
      (
        --  select the count of tags for each city that match our chosen city
        select city_id, count(*) as cnt 
        from cities_tags 
        where tag_id in (select tag_id from cities_tags where city_id=1) 
        and city_id!=1
        group by city_id
      ) as intersection
    on c.id=intersection.city_id
    left join
      (
        -- select the count of tags for each city that are not in our chosen city's tag list
        select city_id, count(tag_id) as cnt
        from cities_tags
        where city_id!=1
        and not tag_id in (select tag_id from cities_tags where city_id=1)
        group by city_id
      ) as not_in
    on c.id=not_in.city_id
    order by jaccard_index desc
    

    The query is a bit lengthy, and I don't know how well it will scale, but it does implement a true Jaccard Index, as requested in the question. Here are the results with the new query:

    +----------+---------------+
    | name     | jaccard_index |
    +----------+---------------+
    | London   |        1.0000 |
    | New York |        0.3333 |
    +----------+---------------+
    

    Edited again to add comments to the query, and take into account when the current city's tags are a subset of the chosen city's tags

提交回复
热议问题