Can anyone explain to me the Lossy Counting algorithm? It is a streaming algorithm on finding frequency of items in a stream. Thanks.
You can find an explanation of how Lossy Counting (and Sticky Sampling) work on this blog post and an open-source version here.
The most frequently viewed items “survive”. Given a frequency threshold f, a frequency error e, and total number of elements N, the output can be expressed as follows: Elements with count exceeding fN – eN.
Worst case we need (1/e) * log (eN) counters.
For example, we may want to print the Facebook pages of people who get hit more than 20%, with an error threshold of 2% (rule of thumb: error = 10% of frequency threshold).
For frequency f = 20%, e = 2%, all elements with true frequency exceeding f = 20% will be output – there are no false negatives. But we undercount. The output frequency of an element can be less than its true frequency by at most 2%. False positives could appear with frequency between 18% – 20%. Last, no element with frequency less than 18% will be output.
Given window of size 1/e, the following guarantees hold: