What is the best way to compute trending topics or tags?

后端 未结 11 1891
太阳男子
太阳男子 2020-12-04 04:34

Many sites offer some statistics like \"The hottest topics in the last 24h\". For example, Topix.com shows this in its section \"News Trends\". There, you can see the topics

相关标签:
11条回答
  • 2020-12-04 04:45

    If you simply look at tweets, or status messages to get your topics, you're going to encounter a lot of noise. Even if you remove all stop words. One way to get a better subset of topic candidates is to focus only on tweets/messages that share a URL, and get the keywords from the title of those web pages. And make sure you apply POS tagging to get nouns + noun phrases as well.

    Titles of web pages usually are more descriptive and contain words that describe what the page is about. In addition, sharing a web page usually is correlated with sharing news that is breaking (ie if a celebrity like Michael Jackson died, you're going to get a lot of people sharing an article about his death).

    I've ran experiments where I only take popular keywords from titles, AND then get the total counts of those keywords across all status messages, and they definitely remove a lot of noise. If you do it this way, you don't need a complex algorith, just do a simple ordering of the keyword frequencies, and you're halfway there.

    0 讨论(0)
  • 2020-12-04 04:45

    The idea is to keep track of such things and notice when they jump significantly as compared to their own baseline.

    So, for queries that have more than a certain threshhold, track each one and when it changes to some value (say almost double) of its historical value, then it is a new hot trend.

    0 讨论(0)
  • 2020-12-04 04:46

    I was wondering if it is at all possible to use regular physics acceleration formula in such a case?

    v2-v1/t or dv/dt
    

    We can consider v1 to be initial likes/votes/count-of-comments per hour and v2 to be current "velocity" per hour in last 24 hours?

    This is more like a question than an answer, but seems it may just work. Any content with highest acceleration will be the trending topic...

    I am sure this may not solve Britney Spears problem :-)

    0 讨论(0)
  • 2020-12-04 04:48

    probably a simple gradient of topic frequency would work -- large positive gradient = growing quickly in popularity.

    the easiest way would be to bin the number of searched each day, so you have something like

    searches = [ 10, 7, 14, 8, 9, 12, 55, 104, 100 ]
    

    and then find out how much it changed from day to day:

    hot_factor = [ b-a for a, b in zip(searches[:-1], searches[1:]) ]
    # hot_factor is [ -3, 7, -6, 1, 3, 43, 49, -4 ]
    

    and just apply some sort of threshold so that days where the increase was > 50 are considered 'hot'. you could make this far more complicated if you'd like, too. rather than absolute difference you can take the relative difference so that going from 100 to 150 is considered hot, but 1000 to 1050 isn't. or a more complicated gradient that takes into account trends over more than just one day to the next.

    0 讨论(0)
  • 2020-12-04 04:49

    This problem calls for a z-score or standard score, which will take into account the historical average, as other people have mention, but also the standard deviation of this historical data, making it more robust than just using the average.

    In your case a z-score is calculated by the following formula, where the trend would be a rate such as views / day.

    z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]
    

    When a z-score is used, the higher or lower the z-score the more abnormal the trend, so for example if the z-score is highly positive then the trend is abnormally rising, while if it is highly negative it is abnormally falling. So once you calculate the z-score for all the candidate trends the highest 10 z-scores will relate to the most abnormally increasing z-scores.

    Please see Wikipedia for more information, about z-scores.

    Code

    from math import sqrt
    
    def zscore(obs, pop):
        # Size of population.
        number = float(len(pop))
        # Average population value.
        avg = sum(pop) / number
        # Standard deviation of population.
        std = sqrt(sum(((c - avg) ** 2) for c in pop) / number)
        # Zscore Calculation.
        return (obs - avg) / std
    

    Sample Output

    >>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9])
    3.5
    >>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20])
    0.0739221270955
    >>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
    1.00303599234
    >>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
    -0.922793112954
    >>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0])
    1.65291949506
    

    Notes

    • You can use this method with a sliding window (i.e. last 30 days) if you wish not to take to much history into account, which will make short term trends more pronounced and can cut down on the processing time.

    • You could also use a z-score for values such as change in views from one day to next day to locate the abnormal values for increasing/decreasing views per day. This is like using the slope or derivative of the views per day graph.

    • If you keep track of the current size of the population, the current total of the population, and the current total of x^2 of the population, you don't need to recalculate these values, only update them and hence you only need to keep these values for the history, not each data value. The following code demonstrates this.

      from math import sqrt
      
      class zscore:
          def __init__(self, pop = []):
              self.number = float(len(pop))
              self.total = sum(pop)
              self.sqrTotal = sum(x ** 2 for x in pop)
          def update(self, value):
              self.number += 1.0
              self.total += value
              self.sqrTotal += value ** 2
          def avg(self):
              return self.total / self.number
          def std(self):
              return sqrt((self.sqrTotal / self.number) - self.avg() ** 2)
          def score(self, obs):
              return (obs - self.avg()) / self.std()
      
    • Using this method your work flow would be as follows. For each topic, tag, or page create a floating point field, for the total number of days, sum of views, and sum of views squared in your database. If you have historic data, initialize these fields using that data, otherwise initialize to zero. At the end of each day, calculate the z-score using the day's number of views against the historic data stored in the three database fields. The topics, tags, or pages, with the highest X z-scores are your X "hotest trends" of the day. Finally update each of the 3 fields with the day's value and repeat the process tomorrow.

    New Addition

    Normal z-scores as discussed above do not take into account the order of the data and hence the z-score for an observation of '1' or '9' would have the same magnitude against the sequence [1, 1, 1, 1, 9, 9, 9, 9]. Obviously for trend finding, the most current data should have more weight than older data and hence we want the '1' observation to have a larger magnitude score than the '9' observation. In order to achieve this I propose a floating average z-score. It should be clear that this method is NOT guaranteed to be statistically sound but should be useful for trend finding or similar. The main difference between the standard z-score and the floating average z-score is the use of a floating average to calculate the average population value and the average population value squared. See code for details:

    Code

    class fazscore:
        def __init__(self, decay, pop = []):
            self.sqrAvg = self.avg = 0
            # The rate at which the historic data's effect will diminish.
            self.decay = decay
            for x in pop: self.update(x)
        def update(self, value):
            # Set initial averages to the first value in the sequence.
            if self.avg == 0 and self.sqrAvg == 0:
                self.avg = float(value)
                self.sqrAvg = float((value ** 2))
            # Calculate the average of the rest of the values using a 
            # floating average.
            else:
                self.avg = self.avg * self.decay + value * (1 - self.decay)
                self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)
            return self
        def std(self):
            # Somewhat ad-hoc standard deviation calculation.
            return sqrt(self.sqrAvg - self.avg ** 2)
        def score(self, obs):
            if self.std() == 0: return (obs - self.avg) * float("infinity")
            else: return (obs - self.avg) / self.std()
    

    Sample IO

    >>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1)
    -1.67770595327
    >>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9)
    0.596052006642
    >>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12)
    3.46442230724
    >>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22)
    7.7773245459
    >>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20)
    -0.24633160155
    >>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20)
    1.1069362749
    >>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2)
    -0.786764452966
    >>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9)
    1.82262469243
    >>> fazscore(0.8, [40] * 200).score(1)
    -inf
    

    Update

    As David Kemp correctly pointed out, if given a series of constant values and then a zscore for an observed value which differs from the other values is requested the result should probably be non-zero. In fact the value returned should be infinity. So I changed this line,

    if self.std() == 0: return 0
    

    to:

    if self.std() == 0: return (obs - self.avg) * float("infinity")
    

    This change is reflected in the fazscore solution code. If one does not want to deal with infinite values an acceptable solution could be to instead change the line to:

    if self.std() == 0: return obs - self.avg
    
    0 讨论(0)
  • 2020-12-04 04:51

    Chad Birch and Adam Davis are correct in that you will have to look backward to establish a baseline. Your question, as phrased, suggests that you only want to view data from the past 24 hours, and that won't quite fly.

    One way to give your data some memory without having to query for a large body of historical data is to use an exponential moving average. The advantage of this is that you can update this once per period and then flush all old data, so you only need to remember a single value. So if your period is a day, you have to maintain a "daily average" attribute for each topic, which you can do by:

    a_n = a_(n-1)*b + c_n*(1-b)
    

    Where a_n is the moving average as of day n, b is some constant between 0 and 1 (the closer to 1, the longer the memory) and c_n is the number of hits on day n. The beauty is if you perform this update at the end of day n, you can flush c_n and a_(n-1).

    The one caveat is that it will be initially sensitive to whatever you pick for your initial value of a.

    EDIT

    If it helps to visualize this approach, take n = 5, a_0 = 1, and b = .9.

    Let's say the new values are 5,0,0,1,4:

    a_0 = 1
    c_1 = 5 : a_1 = .9*1 + .1*5 = 1.4
    c_2 = 0 : a_2 = .9*1.4 + .1*0 = 1.26
    c_3 = 0 : a_3 = .9*1.26 + .1*0 = 1.134
    c_4 = 1 : a_4 = .9*1.134 + .1*1 = 1.1206
    c_5 = 4 : a_5 = .9*1.1206 + .1*5 = 1.40854
    

    Doesn't look very much like an average does it? Note how the value stayed close to 1, even though our next input was 5. What's going on? If you expand out the math, what you get that:

    a_n = (1-b)*c_n + (1-b)*b*c_(n-1) + (1-b)*b^2*c_(n-2) + ... + (leftover weight)*a_0
    

    What do I mean by leftover weight? Well, in any average, all weights must add to 1. If n were infinity and the ... could go on forever, then all weights would sum to 1. But if n is relatively small, you get a good amount of weight left on the original input.

    If you study the above formula, you should realize a few things about this usage:

    1. All data contributes something to the average forever. Practically speaking, there is a point where the contribution is really, really small.
    2. Recent values contribute more than older values.
    3. The higher b is, the less important new values are and the longer old values matter. However, the higher b is, the more data you need to water down the initial value of a.

    I think the first two characteristics are exactly what you are looking for. To give you an idea of simple this can be to implement, here is a python implementation (minus all the database interaction):

    >>> class EMA(object):
    ...  def __init__(self, base, decay):
    ...   self.val = base
    ...   self.decay = decay
    ...   print self.val
    ...  def update(self, value):
    ...   self.val = self.val*self.decay + (1-self.decay)*value
    ...   print self.val
    ... 
    >>> a = EMA(1, .9)
    1
    >>> a.update(10)
    1.9
    >>> a.update(10)
    2.71
    >>> a.update(10)
    3.439
    >>> a.update(10)
    4.0951
    >>> a.update(10)
    4.68559
    >>> a.update(10)
    5.217031
    >>> a.update(10)
    5.6953279
    >>> a.update(10)
    6.12579511
    >>> a.update(10)
    6.513215599
    >>> a.update(10)
    6.8618940391
    >>> a.update(10)
    7.17570463519
    
    0 讨论(0)
提交回复
热议问题