Finding the most common three-item sequence in a very large file

前端 未结 5 1689
耶瑟儿~
耶瑟儿~ 2021-02-02 17:09

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page

5条回答
  •  无人共我
    2021-02-02 17:34

    If you want to quickly get an approximate result, use hash tables, as you intended, but add a limited-size queue to each hash table to drop least recently used entries.

    If you want exact result, use external sort procedure to sort logs by userid, then combine every 3 consecutive entries and sort again, this time - by page IDs.

    Update (sort by timestamp)

    Some preprocessing may be needed to properly use logfiles' timestamps:

    • If the logfiles are already sorted by timestamp, no preprocessing needed.
    • If there are several log files (possibly coming from independent processes), and each file is already sorted by timestamp, open all these files and use merge sort to read them.
    • If files are almost sorted by timestamp (as if several independent processes write logs to single file), use binary heap to get data in correct order.
    • If files are not sorted by timestamp (which is not likely in practice), use external sort by timestamp.

    Update2 (Improving approximate method)

    Approximate method with LRU queue should produce quite good results for randomly distributed data. But webpage visits may have different patterns at different time of day, or may be different on weekends. The original approach may give poor results for such data. To improve this, hierarchical LRU queue may be used.

    Partition LRU queue into log(N) smaller queues. With sizes N/2, N/4, ... Largest one should contain any elements, next one - only elements, seen at least 2 times, next one - at least 4 times, ... If element is removed from some sub-queue, it is added to other one, so it lives in all sub-queues, which are lower in hierarchy, before it is completely removed. Such a priority queue is still of O(1) complexity, but allows much better approximation for most popular page.

提交回复
热议问题