What is Lossy Counting?

前端 未结 2 724
暖寄归人
暖寄归人 2021-02-01 07:26

Can anyone explain to me the Lossy Counting algorithm? It is a streaming algorithm on finding frequency of items in a stream. Thanks.

相关标签:
2条回答
  • 2021-02-01 07:38

    Say you're looking at the traffic for facebook profiles. You have billions of hits. You want to find which profiles are accessed the most often. You could keep a count for each profile, but then you'd have a very large number of counts to keep track of, the vast majority of which would be meaningless.

    With lossy counting, you periodically remove very low count elements from the table. The most-frequently accessed profiles would almost never have low counts anyway, and if they did, they wouldn't be likely to stay there for long.

    The algorithm basically involves grouping the inputs into blocks or chunks and counting within each chunk. Then you reduce the count for each element by one, dropping any elements whose counts drop to zero.

    The most-frequently hit profiles will get on your count and stay there. Any profiles that aren't hit very often will drop to zero in a few blocks and you won't have to track them any more.

    Note that the final results are order-dependent, giving heavier weight to the counts processed last. In some cases, this makes perfect sense and is an upside rather than a downside. (If you want to know basically which profiles are the most popular now, you want to weigh accesses today more than accesses last month.)

    There are a large number of refinements to the algorithm. But the basic idea is this -- to find the heavy hitters without having to track every element, periodically purge your counts of any elements that don't seem likely to be heavy hitters based on the data so far.

    0 讨论(0)
  • 2021-02-01 07:56

    You can find an explanation of how Lossy Counting (and Sticky Sampling) work on this blog post and an open-source version here.

    The most frequently viewed items “survive”. Given a frequency threshold f, a frequency error e, and total number of elements N, the output can be expressed as follows: Elements with count exceeding fN – eN.

    Worst case we need (1/e) * log (eN) counters.

    For example, we may want to print the Facebook pages of people who get hit more than 20%, with an error threshold of 2% (rule of thumb: error = 10% of frequency threshold).

    For frequency f = 20%, e = 2%, all elements with true frequency exceeding f = 20% will be output – there are no false negatives. But we undercount. The output frequency of an element can be less than its true frequency by at most 2%. False positives could appear with frequency between 18% – 20%. Last, no element with frequency less than 18% will be output.

    Given window of size 1/e, the following guarantees hold:

    • Frequencies are underestimated by at most e*N
    • No false negatives
    • False positives have true frequency of at least fN – eN
    0 讨论(0)
提交回复
热议问题