SQL query to find rows with the most matching keywords

你说的曾经没有我的故事 提交于 2019-12-11 02:01:21

问题


I'm really bad at SQL and I would like to know what SQL I can run to solve the problem below which I suspect to be a NP-Complete problem but I'm ok with the query taking a long time to run over large datasets as this will be done as a background task. A standard sql statement is preferred but if a stored procedure is required then so be it. The SQL is required to run on Postgres 9.3.

Problem: Given a set of articles that contain a set of keywords, find the top n articles for each article that contains the most number of matching keywords.

A trimmed down version of the article table looks like this:

CREATE TABLE article (
  id character varying(36) NOT NULL,  -- primary key of article
  keywords character varying,         -- comma separated set of keywords

  CONSTRAINT pk_article PRIMARY KEY (id)
);

-- Test Data
INSERT INTO article(id, keywords) VALUES(0, 'red,green,blue');
INSERT INTO article(id, keywords) VALUES(1, 'red,green,yellow');
INSERT INTO article(id, keywords) VALUES(2, 'purple,orange,blue');
INSERT INTO article(id, keywords) VALUES(3, 'lime,violet,ruby,teal');
INSERT INTO article(id, keywords) VALUES(4, 'red,green,blue,yellow');
INSERT INTO article(id, keywords) VALUES(5, 'yellow,brown,black');
INSERT INTO article(id, keywords) VALUES(6, 'black,white,blue');

Which would result in this for a SELECT * FROM article; query:

Table: article
------------------------
id  keywords            
------------------------
0   red,green,blue      
1   red,green,yellow    
2   purple,orange,blue  
3   lime,violet,ruby,teal
4   red,green,blue,yellow
5   yellow,brown,black
6   black,white,blue

Assuming I want to find the top 3 articles for each article that contains the most number of matching keywords then the output should be this:

------------------------
id  related
------------------------
0   4,1,6
1   4,0,5
2   0,4,6
3   null
4   0,1,6
5   1,6
6   5,0,4

回答1:


Like @a_horse commented: This would be simpler with a normalized design (besides making other tasks simpler/ cleaner), but still not trivial.

Also, a PK column of data type character varying(36) is highly suspicious (and inefficient) and should most probably be an integer type or at least a uuid instead.

Here is one possible solution based on your design as is:

WITH cte AS (
   SELECT id, string_to_array(a.keywords, ',') AS keys
   FROM   article a
   )
SELECT id, string_agg(b_id, ',') AS best_matches
FROM  (
   SELECT a.id, b.id AS b_id
        , row_number() OVER (PARTITION BY a.id ORDER BY ct.ct DESC, b.id) AS rn
   FROM   cte a
   LEFT   JOIN cte b ON a.id <> b.id AND a.keys && b.keys
   LEFT   JOIN LATERAL (
      SELECT count(*) AS ct
      FROM  (
         SELECT * FROM unnest(a.keys)
         INTERSECT ALL
         SELECT * FROM unnest(b.keys)
         ) i
      ) ct ON TRUE
   ORDER  BY a.id, ct.ct DESC, b.id  -- b.id as tiebreaker
   ) sub
WHERE  rn < 4
GROUP  BY 1;

SQL Fiddle (using an integer id instead).

The CTE cte converts the string into an array. You could even have a functional GIN index like that ...

If multiple rows tie for the top 3 picks, you need to define a tiebreaker. In my example, rows with smaller id come first.

Detailed explanation in this recent related answer:

  • Query and order by number of matches in JSON array

    The comparison is between a JSON array and an SQL array, but it's basically the same problem, burns down to the same solution(s). Also comparing a couple of similar alternatives.

To make this fast, you should at least have a GIN index on the array column (instead of the comma-separated string) and the query wouldn't need the CTE step. A completely normalized design has other advantages, but won't necessarily be faster than an array with GIN index.




回答2:


You can store lists in comma-separated strings. No problem, as long as this is just a string for you and you are not interested in its separate values. As soon as you are interested in the separate values, as in your example, store them separately.

This said, correct your database design and only then think about the query.

The following query selects all ID pairs first and counts common keywords. It then ranks the pairs by giving the other ID with the most keywords in common rank #1, etc. Then you keep only the three best matching IDs. STRING_AGG lists the best matching IDs in a string ordered by the number of keywords in common.

select 
  this_article as id,
  string_agg(other_article, ',' order by rn) as related
from
(
  select 
    this_article, 
    other_article, 
    row_number() over (partition by this_article order by cnt_common desc) as rn
  from
  (
    select 
      this.id as this_article, 
      other.id as other_article, 
      count(other.id) as cnt_common
    from keywords this
    left join keywords other on other.keyword = this.keyword and other.id <> this.id
    group by this.id, other.id
  ) pairs
) ranked
where rn <= 3
group by this_article
order by this_article;

Here is the SQL fiddle: http://sqlfiddle.com/#!15/1d20c/9.



来源:https://stackoverflow.com/questions/31512506/sql-query-to-find-rows-with-the-most-matching-keywords

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!