I have a cities table which looks like this.
|id| Name |
|1 | Paris |
|2 | London |
|3 | New York|
I have a tags table which looks li
select c.name, cnt.val/(select count(*) from cities) as jaccard_index
from cities c
inner join
(
select city_id, count(*) as val
from cities_tags
where tag_id in (select tag_id from cities_tags where city_id=1)
and not city_id in (1)
group by city_id
) as cnt
on c.id=cnt.city_id
order by jaccard_index desc
This query is statically referring to city_id=1
, so you'll have to make that a variable in both the where tag_id in
clause, and the not city_id in
clause.
If I understood the Jaccard index properly, then it also returns that value ordered by the 'most closely related'. The results in our example look like this:
|name |jaccard_index |
|London |0.6667 |
|New York |0.3333 |
With a better understanding of how to implement Jaccard Index:
After reading a bit more on wikipedia about the Jaccard Index, I've come up with a better way implement a query for our example dataset. Essentially, we will be comparing our chosen city with each other city in the list independently, and using the count of common tags divided by the count of distinct total tags selected between the two cities.
select c.name,
case -- when this city's tags are a subset of the chosen city's tags
when not_in.cnt is null
then -- then the union count is the chosen city's tag count
intersection.cnt/(select count(tag_id) from cities_tags where city_id=1)
else -- otherwise the union count is the chosen city's tag count plus everything not in the chosen city's tag list
intersection.cnt/(not_in.cnt+(select count(tag_id) from cities_tags where city_id=1))
end as jaccard_index
-- Jaccard index is defined as the size of the intersection of a dataset, divided by the size of the union of a dataset
from cities c
inner join
(
-- select the count of tags for each city that match our chosen city
select city_id, count(*) as cnt
from cities_tags
where tag_id in (select tag_id from cities_tags where city_id=1)
and city_id!=1
group by city_id
) as intersection
on c.id=intersection.city_id
left join
(
-- select the count of tags for each city that are not in our chosen city's tag list
select city_id, count(tag_id) as cnt
from cities_tags
where city_id!=1
and not tag_id in (select tag_id from cities_tags where city_id=1)
group by city_id
) as not_in
on c.id=not_in.city_id
order by jaccard_index desc
The query is a bit lengthy, and I don't know how well it will scale, but it does implement a true Jaccard Index, as requested in the question. Here are the results with the new query:
+----------+---------------+
| name | jaccard_index |
+----------+---------------+
| London | 1.0000 |
| New York | 0.3333 |
+----------+---------------+
Edited again to add comments to the query, and take into account when the current city's tags are a subset of the chosen city's tags