Understanding algorithms for measuring trends

前端 未结 4 1006
醉梦人生
醉梦人生 2021-01-30 02:17

What\'s the rationale behind the formula used in the hive_trend_mapper.py program of this Hadoop tutorial on calculating Wikipedia trends?

There are actuall

相关标签:
4条回答
  • 2021-01-30 02:20

    The reason for moderating the measure by the volume of clicks is not to penalise popular pages but to make sure that you can compare large and small changes with a single measure. If you just use y2 - y1 you will only ever see the click changes on large volume pages. What this is trying to express is "significant" change. 1000 clicks change if you attract 100 clicks is really significant. 1000 click change if you attract 100,000 is less so. What this formula is trying to do is make both of these visible.

    Try it out at a few different scales in Excel, you'll get a good view of how it operates.

    Hope that helps.

    0 讨论(0)
  • 2021-01-30 02:21

    As the in-line comment goes, this is a simple "baseline trend algorithm", which basically means before you compare the trends of two different pages, you have to establish a baseline. In many cases, the mean value is used, it's straightforward if you plot the pageviews against the time axis. This method is widely used in monitoring water quality, air pollutants, etc. to detect any significant changes w.r.t the baseline.

    In OP's case, the slope of pageviews is weighted by the log of totalpageviews. This sorta uses the totalpageviews as a baseline correction for the slope. As Simon put it, this puts a balance between two pages with very different totalpageviews. For exmaple, A has a slope 500 over 1000,000 total pageviews, B is 1000 over 1,000. A log basically means 1000,000 is ONLY twice more important than 1,000 (rather than 1000 times). If you only consider the slope, A is less popular than B. But with a weight, now the measure of popularity of A is the same as B. I think it is quite intuitive: though A's pageviews is only 500 pageviews, but that's because it's saturating, you still gotta give it enough credit.

    As for the error, I believe it comes from the (relative) standard error, which has a factor 1/sqrt(n), where n is the number of data points. In the code, the error is equal to (1/sqrt(n))*(1/sqrt(mean)). It roughly translates into : the more data points, the more accurate the trend. I don't see it is an exact math formula, just a brute trend analysis algorithm, anyway the relative value is more important in this context.

    In summary, I believe it's just an empirical formula. More advanced topics can be found in some biostatistics textbooks (very similar to monitoring the breakout of a flu or the like.)

    0 讨论(0)
  • 2021-01-30 02:40

    The code implements statistics (in this case the "baseline trend"), you should educate yourself on that and everything becomes clearer. Wikibooks has a good instroduction.

    The algorithm takes into account that new pages are by definition more unpopular than existing ones (because - for example - they are linked from relatively few other places) and suggests that those new pages will grow in popularity over time.

    error is the error margin the system expects for its prognoses. The higher error is, the more unlikely the trend will continue as expected.

    0 讨论(0)
  • 2021-01-30 02:43

    another way to look at it is this:

    suppose your page and my page are made at same day, and ur page gets total views about ten million, and mine about 1 million till some point. then suppose the slope at some point is a million for me, and 0.5 million for you. if u just use slope, then i win, but ur page already had more views per day at that point, urs were having 5 million, and mine 1 million, so that a million on mine still makes it 2 million, and urs is 5.5 million for that day. so may be this scaling concept is to try to adjust the results to show that ur page is also good as a trend setter, and its slope is less but it already was more popular, but the scaling is only a log factor, so doesnt seem too problematic to me.

    0 讨论(0)
提交回复
热议问题