Get ranking of words over date based on frequency in PostgreSQL

问题

I have a database that stores twitter data:

        Create Table tweet(
            ID BIGINT UNIQUE,
            user_ID BIGINT,
            created_at TIMESTAMPTZ,
            tweet TEXT;

I'm trying to write a query that goes through the words in tweet for all rows gets the frequency of each word, and returns the top ten most frequent words along with the words' ranking over each date.

Example:

("word1":[1,20,22,23,24,25,26,27,28,29,30,29,28,27,26,25,26,27,28,29,30,29,28,29,28,27,28,29,30,30,...],
'word2' [...])

My current query gets the top ten words, but I am having some trouble getting the rankings of those words for each day.

Current query:

    SELECT word, count(*)
    FROM (
        SELECT regexp_split_to_table(
            regexp_replace(tweet_clean, '\y(rt|co|https|amp|f)\y', '', 'g'), '\s+')
        AS word
    FROM tweet
    ) t
    GROUP BY word
    ORDER BY count(*) DESC
    LIMIT 10;

Which returns:

[('vaccine', 286669),
 ('covid', 213857),
 ('yum', 141345),
 ('pfizer', 39532),
 ('people', 28960),
 ('beer', 27117),
 ('say', 24569),
 ('virus', 23682),
 ('want', 21988),
 ('foo', 19823)]

回答1:

If you want the top 10 per day, you can do:

select *
from (
    select date_trunc('day', created_at) as created_day, word, count(*) as cnt,
        rank() over(partition by date_trunc('day', created_at) order by count(*) desc) rn
    from tweet t
    cross join lateral regexp_split_to_table(
        regexp_replace(tweet_clean, '\y(rt|co|https|amp|f)\y', '', 'g'),
        '\s+'
    ) w(word)
    group by created_day, word
) t
where rn <= 10
order by created_day, rn desc

回答2:

If I understand correctly, you want 10 rows for the most common words. Then you want an array of frequencies. Assuming that each word is used on each day, this should do that:

select wd.word,
       array_agg(day_rank) over (order by created_day) as ranks
from (select date_trunc('day', t.created_at) as created_day, w.word,
             sum(count(*)) as total_cnt,
             rank() over(partition by date_trunc('day', created_at) order by count(*) desc) as day_rank
      from tweet t cross join lateral
           regexp_split_to_table(regexp_replace(tweet_clean, '\y(rt|co|https|amp|f)\y', '', 'g'
                                               ), '\s+'
                                ) w(word)
      group by created_day, word
     ) wd
order by total_cnt desc
limit 10;

The challenge here is that the arrays could be of different lengths. In Postgres, you can add the additional values -- but it is not exactly clear what should be placed there for the ranking.

The issue is that the ranking is per day. So, consider two days, one that has 100 words and one that has 10 words. In the first, a ranking of "10" is a very high ranking. A ranking of 10 in the second is very low.

I might suggest that you think about this issue and ask a new question if you need help resolving it.

来源：https://stackoverflow.com/questions/65354100/get-ranking-of-words-over-date-based-on-frequency-in-postgresql

标签

sql

postgresql

count

greatest-n-per-group

lateral-join