In my PHP application, I have a mysql table of articles which has the following columns:
article_id articletext category_id score
Just for learning purpose. I made a test with 3 categories. I have no idea how this query could run on a large recordset.
select * from (
(select @r:=@r+1 as rownum,article_id,articletext,category_id,score
from articles,(select @r:=0) as r
where category_id = 1
order by score desc limit 100000000)
union all
(select @r1:=@r1+1,article_id,articletext,category_id,score
from articles,(select @r1:=0) as r
where category_id = 2
order by score desc limit 100000000)
union all
(select @r2:=@r2+1,article_id,articletext,category_id,score
from articles,(select @r2:=0) as r
where category_id = 3
order by score desc limit 100000000)
) as t
order by rownum,score desc
Your naive solution is exactly what I would do.
Go get the top 20. If they don't satisfy the requirements, do an additional query to get the missing pieces. You should be able to come up with some balance between number of queries and number of rows each returns.
I you got the top 100 it might satisfy the requirements 90% of the time and would be cheaper and faster than 10 separate queries.
If it was SQL Server I could help more...
Actually, I have another idea. Run a process every 5 minutes that calculates the list and caches it in a table. Make DML against related tables invalidate the cache so it is not used until repopulated (perhaps an article was deleted). If the cache is invalid, you would fall back to calculating it on the fly... And could use that to repopulate the cache anyway.
It might be possible to strategically update the cached list rather than recalculate it. But that could be a real challenge.
This should help both with query speed and reducing load on your database. It shouldn't matter much if your article list is 5 minutes out of date. Heck, even 1 minute might work.