Need help with SQL for ranking search results

心不动则不痛 提交于 2019-12-22 00:26:29

问题


I am trying to build a tiny exercise search engine using mysql.

Each exercise can have an arbitrary number of search tags.

Here is my data structure:

TABLE exercises
  ID
  title

TABLE searchtags
  ID
  title

TABLE exerciseSearchtags
  exerciseID -> exercises.ID
  searchtagID -> searchtags.ID

...where exerciseSearchtags is a many to many join table expressing the relationship between exercises and searchtags.

The search engine accepts an unknown number of user inputted keywords.

I would like to rank search results based on the number of keyword / searchtag matches.

Here is the sql I am currently using to select for exercises. Both the CASE rules and the WHERE rules are dynamically generated, one for each keyword. So for example, if a user enters 3 keywords, there will be 3 CASE rules and 3 WHERE rules.

    SELECT 
        exercises.ID AS ID,
        exercises.title AS title, 
        (
            (CASE WHEN searchtags.title LIKE CONCAT('%',?,'%') THEN 1 ELSE 0 END)+
            (CASE WHEN searchtags.title LIKE CONCAT('%',?,'%') THEN 1 ELSE 0 END)+
            ...etc...
            (CASE WHEN searchtags.title LIKE CONCAT('%',?,'%') THEN 1 ELSE 0 END)
        ) AS relevance

    FROM 
        exercises

    LEFT JOIN exerciseSearchtags
        ON exerciseSearchtags.exerciseID = exercises.ID 

    LEFT JOIN searchtags
        ON searchtags.ID = exerciseSearchtags.searchtagID

    WHERE
        searchtags.title LIKE CONCAT('%',?,'%') OR
        searchtags.title LIKE CONCAT('%',?,'%') OR
        ...etc...
        searchtags.title LIKE CONCAT('%',?,'%') 

    GROUP BY 
        exercises.ID                

    ORDER BY 
        relevance DESC

This almost works. However the results are not being ranked in the order I would expect.

My best guess as to why this is happening, is that the relevence score is being calculated BEFORE the rows are grouped by exercise.ID. So if the left join causes a particular exercise to appear 10 times in the result set, and another exercise to appear 4 times, then the first exercise may get a higher relevence score, even though it may not have more keyword / searchtag matches.

Does anyone have any suggestions / advice on how I can prevent this from happening / fix this?

Thanks (in advance) for your help.


回答1:


I have found a working solution to the above problem, and am posting it here, in case anyone else experiences a similar problem.

The solution is to use a sub-select, instead of a case statement. Here is the above divet of code, corrected. (I do not know if this is the best or most efficient solution, but it has fixed the trouble for me, time being, and seems to return search results reasonably quickly.)

SELECT 
    exercises.ID AS ID,
    exercises.title AS title, 
    (
        (
            SELECT COUNT(1) 
            FROM searchtags 
            LEFT JOIN exerciseSearchtags 
            ON exerciseSearchtags.searchtagID = searchtags.ID 
            WHERE searchtags.title LIKE CONCAT('%',?,'%') 
            AND exerciseSearchtags.exerciseID = exercises.ID
        )+
        (
            SELECT COUNT(1) 
            FROM searchtags 
            LEFT JOIN exerciseSearchtags 
            ON exerciseSearchtags.searchtagID = searchtags.ID 
            WHERE searchtags.title LIKE CONCAT('%',?,'%') 
            AND exerciseSearchtags.exerciseID = exercises.ID
        )+
        ...etc...
        (
            SELECT COUNT(1) 
            FROM searchtags 
            LEFT JOIN exerciseSearchtags 
            ON exerciseSearchtags.searchtagID = searchtags.ID 
            WHERE searchtags.title LIKE CONCAT('%',?,'%') 
            AND exerciseSearchtags.exerciseID = exercises.ID
        )
    ) AS relevance

FROM 
    exercises

LEFT JOIN exerciseSearchtags
    ON exerciseSearchtags.exerciseID = exercises.ID 

LEFT JOIN searchtags
    ON searchtags.ID = exerciseSearchtags.searchtagID

WHERE
    searchtags.title LIKE CONCAT('%',?,'%') OR
    searchtags.title LIKE CONCAT('%',?,'%') OR
    ...etc...
    searchtags.title LIKE CONCAT('%',?,'%') 

GROUP BY 
    exercises.ID                

ORDER BY 
    relevance DESC



回答2:


Divide and conquer. Instead of trying to do all in one statement, try decomposing the problem into smaller pieces. For instance, first create a temporary table with all the exercises that contain at least one of the search tags. Then make a second pass to rank each exercise in the temp table. Finally select the result ordered by ranking.




回答3:


I have only done something similar for MSSQL not mySQL... so this might not be relevant at all, but its worth a shot :)

I had to put the CASE's as part of the ORDER BY clause to get it to pick it up correctly e.g.:

ORDER BY
    CASE WHEN searchtags.title LIKE CONCAT('%',?,'%') THEN 1 ELSE 0 END +
    CASE WHEN searchtags.title LIKE CONCAT('%',?,'%') THEN 1 ELSE 0 END +
    ...etc...
    CASE WHEN searchtags.title LIKE CONCAT('%',?,'%') THEN 1 ELSE 0 END DESC

While also leaving them in the SELECT so i could output the relevance on the page (as requested)

Either way, good luck with it!



来源:https://stackoverflow.com/questions/4075026/need-help-with-sql-for-ranking-search-results

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!