What is the best way to compute trending topics or tags?

后端 未结 11 1892
太阳男子
太阳男子 2020-12-04 04:34

Many sites offer some statistics like \"The hottest topics in the last 24h\". For example, Topix.com shows this in its section \"News Trends\". There, you can see the topics

相关标签:
11条回答
  • 2020-12-04 05:04

    I had worked on a project, where my aim was finding Trending Topics from Live Twitter Stream and also doing sentimental analysis on the trending topics (finding if Trending Topic positively/negatively talked about). I've used Storm for handling twitter stream.

    I've published my report as a blog: http://sayrohan.blogspot.com/2013/06/finding-trending-topics-and-trending.html

    I've used Total Count and Z-Score for the ranking.

    The approach that I've used is bit generic, and in the discussion section, I've mentioned that how we can extend the system for non-Twitter Application.

    Hope the information helps.

    0 讨论(0)
  • 2020-12-04 05:08

    I think they key word you need to notice is "abnormally". In order to determine when something is "abnormal", you have to know what is normal. That is, you're going to need historical data, which you can average to find out the normal rate of a particular query. You may want to exclude abnormal days from the averaging calculation, but again that'll require having enough data already, so that you know which days to exclude.

    From there, you'll have to set a threshold (which would require experimentation, I'm sure), and if something goes outside the threshold, say 50% more searches than normal, you can consider it a "trend". Or, if you want to be able to find the "Top X Trendiest" like you mentioned, you just need to order things by how far (percentage-wise) they are away from their normal rate.

    For example, let's say that your historical data has told you that Britney Spears usually gets 100,000 searches, and Paris Hilton usually gets 50,000. If you have a day where they both get 10,000 more searches than normal, you should be considering Paris "hotter" than Britney, because her searches increased 20% more than normal, while Britney's were only 10%.

    God, I can't believe I just wrote a paragraph comparing "hotness" of Britney Spears and Paris Hilton. What have you done to me?

    0 讨论(0)
  • 2020-12-04 05:09

    Typically "buzz" is figured out using some form of exponential/log decay mechanism. For an overview of how Hacker News, Reddit, and others handle this in a simple way, see this post.

    This doesn't fully address the things that are always popular. What you're looking for seems to be something like Google's "Hot Trends" feature. For that, you could divide the current value by a historical value and then subtract out ones that are below some noise threshold.

    0 讨论(0)
  • 2020-12-04 05:09

    You could use log-likelihood-ratios to compare the current date with the last month or year. This is statistically sound (given that your events are not normally distributed, which is to be assumed from your question).

    Just sort all your terms by logLR and pick the top ten.

    public static void main(String... args) {
        TermBag today = ...
        TermBag lastYear = ...
        for (String each: today.allTerms()) {
            System.out.println(logLikelihoodRatio(today, lastYear, each) + "\t" + each);
        }
    } 
    
    public static double logLikelihoodRatio(TermBag t1, TermBag t2, String term) {
        double k1 = t1.occurrences(term); 
        double k2 = t2.occurrences(term); 
        double n1 = t1.size(); 
        double n2 = t2.size(); 
        double p1 = k1 / n1;
        double p2 = k2 / n2;
        double p = (k1 + k2) / (n1 + n2);
        double logLR = 2*(logL(p1,k1,n1) + logL(p2,k2,n2) - logL(p,k1,n1) - logL(p,k2,n2));
        if (p1 < p2) logLR *= -1;
        return logLR;
    }
    
    private static double logL(double p, double k, double n) {
        return (k == 0 ? 0 : k * Math.log(p)) + ((n - k) == 0 ? 0 : (n - k) * Math.log(1 - p));
    }
    

    PS, a TermBag is an unordered collection of words. For each document you create one bag of terms. Just count the occurrences of words. Then the method occurrences returns the number of occurrences of a given word, and the method size returns the total number of words. It is best to normalize the words somehow, typically toLowerCase is good enough. Of course, in the above examples you would create one document with all queries of today, and one with all queries of the last year.

    0 讨论(0)
  • You need an algorithm that measures the velocity of a topic - or in other words, if you graph it you want to show those that are going up at an incredible rate.

    This is the first derivative of the trend line, and it is not difficult to incorporate as a weighted factor of your overall calculation.

    Normalize

    One technique you'll need to do is to normalize all your data. For each topic you are following, keep a very low pass filter that defines that topic's baseline. Now every data point that comes in about that topic should be normalized - subtract its baseline and you'll get ALL of your topics near 0, with spikes above and below the line. You may instead want to divide the signal by its baseline magnitude, which will bring the signal to around 1.0 - this not only brings all signals in line with each other (normalizes the baseline), but also normalizes the spikes. A britney spike is going to be magnitudes larger than someone else's spike, but that doesn't mean you should pay attention to it - the spike may be very small relative to her baseline.

    Derive

    Once you've normalized everything, figure out the slope of each topic. Take two consecutive points, and measure the difference. A positive difference is trending up, a negative difference is trending down. Then you can compare the normalized differences, and find out what topics are shooting upward in popularity compared to other topics - with each topic scaled appropriate to it's own 'normal' which may be magnitudes of order different from other topics.

    This is really a first-pass at the problem. There are more advanced techniques which you'll need to use (mostly a combination of the above with other algorithms, weighted to suit your needs) but it should be enough to get you started.

    Regarding the article

    The article is about topic trending, but it's not about how to calculate what's hot and what's not, it's about how to process the huge amount of information that such an algorithm must process at places like Lycos and Google. The space and time required to give each topic a counter, and find each topic's counter when a search on it goes through is huge. This article is about the challenges one faces when attempting such a task. It does mention the Brittney effect, but it doesn't talk about how to overcome it.

    As Nixuz points out this is also referred to as a Z or Standard Score.

    0 讨论(0)
提交回复
热议问题