Suppose we want a system to keep top k frequent words appear in tweets in last one hour. How to design it?
I can come up with hashmap, heap, log or MapReduce but I c
Here is an algorithm which is fairly efficient for your purpose : -
- firstly use a dictionary rather than hashmap to store the strings because it provides better space efficiency.
- Map index in dictionary to hashmap for frequency.
- Then maintain a min heap to store indices of k most frequent words.
- Add pointer for each word which gives its location in the heap (-1 if not present).
- If word frequency is updated then check if it exists in heap then use heapify on it using its direct location in the heap using pointers maintained alongside heap.
- If word is not present and has frequency greater than top then delete top and insert word and update pointer of word in heap.
Time complexity : -
Updating top k :- O(logk)
for heapify,insert,delete
Updating or Search words : O(|W|) where |W| is length of word
Space Complexity for heap : O(k)
Space for dictionary,HashMap,heap pointers :- O(N)
N
is total words
You can use a TreeMap which is basically a sorted hashmap. In java, you can make the TreeMap list out it's entries in descending order (by overriding the comparison method in the Comparable interface). In this case, the top k entries after the specified period of time would give you the result.
If the words are not weighted (other than weights 0 and 1), then it is possible to derive a simple datastructure which maintains the word counts in order, using O(N) auxiliary storage where N
is the number of unique words encountered in the sliding window (one hour, in the example). All operations (add a word, expire a word, look up the most frequent word) can be performed in O(1)
time. Since any accurate solution needs to retain all the unique words in the sliding window, this solution is not asymptotically worse although the constant factor per word is not small.
The key to the solution is that the count for any given word can only be incremented or decremented by 1, and that all of the counts are integers. Consequently, it is possible to maintain a doubly-linked list of counts (in order) where each node in the list points to a double-linked list of words which have that count. In addition, each node in the word-list points back to the appropriate count node. Finally, we maintain a hashmap which allows us to find the node corresponding to a given word.
Finally, in order to decay the words at the end of their life, we need to retain the entire datastream from the sliding window, which has a size of O(N')
where N'
is the total number of words encountered during the sliding window. This can be stored as a singly-linked list where each node has a timestamp and a pointer to the unique word in the word-list.
When a word is encountered or expired, its count needs to be adjusted. Since the count can only be incremented or decremented by 1, the adjustment always consists in moving the word to the adjacent count-node (which may or may not exist); since the count-nodes are stored in a sorted linked list, the adjacent node can be found or created in time O(1)
. Furthermore, the most popular words (and counts) can always be traced in constant time by traversing the count list backwards from the maximum.
In case that was not obvious, here is a rough ascii art drawing of the datastructure at a given point in time:
Count list word lists (each node points back to the count node)
17 a <--> the <--> for
^
|
v
12 Wilbur <--> drawing
^
|
v
11 feature
Now, suppose we find a Wilbur
. That will raise its count to 13; we can see from the fact that the success of 12
is not 13
that the 13
count node needs to be created and inserted into the count-list. After we do that, we remove Wilbur
from its current word-list, put it into the newly-created empty word-list associated with the new count node, and change the count-pointer in Wilbur
to point to the new count node.
Then, suppose that a use of drawing
expires, so its new count will be 11. We can see from the fact that the predecessor of 12
is 11
that no new count node needs to be created; we simply remove drawing
from its word-list and attach it to the word-list associate with 11
, fixing its count-pointer as we do so. Now we notice that the word-list associated with 12
is empty, so we can remove the 12
count node from the count-list and delete it.
When the count for a word reaches 0, rather than attaching it to the 0
count node, which doesn't exist, we just delete the word node. And if a new word is encountered, we just add the word to the 1
count node, creating that count node if it doesn't exist.
In the worst case, every word has a unique count, so the size of the count-list cannot be greater than the number of unique words. Also, the total size of the word-lists is exactly the number of unique words because every word is in exactly one word-list, and fully-expired words don't appear in the word-lists at all.
--- EDIT
This algorithm is a bit RAM-hungry but it really shouldn't have any troubles holding an hour's worth of tweets. Or even a day's worth. And the number of unique words is not going to change much after a few days, even considering abbreviations and misspellings. Even so, it's worth thinking about ways to reduce the memory footprint and/or make the algorithm parallel.
To reduce the memory footprint, the easiest thing is to just drop words which are still unique after a few minutes. This will dramatically cut down on the unique word count, without altering the counts of popular words. Indeed, you could prune a lot more drastically without altering the final result.
To run the algorithm in parallel, individual words can be allocated to different machines by using a hash function to generate a machine number. (Not the same hash function as the one used to construct the hash tables.) Then the top k
words can be found by merging the top k
words from each machine; the allocation by hash guarantees that the set of words from each machine is distinct.
This set of problems is called data stream algorithms. In your particular case there are two that fit - "Lossy Counting" and "Sticky Sampling" This is the paper that explains them or this, with pictures. This is a more simplified introduction.
Edit: (too long, to fit into a comment)
Although these streaming algos do not discount expired data per-se, one can run for instance 60 sliding windows, one for each minute of the hour and then delete and create a new one every minute. The sliding window on top is used for queering, other for updates only. This gives you a 1m resolution.
Critiques says, that streaming algos are probabilistic, and would not give you exact count, while this is true, please compare for instance with Rici's algo here, one does control error frequency, and can make it very low if desired. As you stream grows you would want to set it in % from the stream size, rather than in absolute value.
Streaming algos are very memory efficient, which is the most important things when crunching large streams in real time. Compare with Rici's precise algo which requires a single host to keep all data in memory for the current sliding window. It might not scale well - increase rate 100/s -> 100k/s or time window size 1h -> 7d and you will run out of memory on a single host.
Hastables that are essential part of the Rici's algo require one continuous memory blob which becomes more and more problematic as they grow.