Parallel top ten algorithm for distributed data

前端 未结 5 1025
难免孤独
难免孤独 2021-01-30 15:11

This is an interview question. Suppose there are a few computers and each computer keeps a very large log file of visited URLs. Find the top ten most visited URLs.

5条回答
  •  星月不相逢
    2021-01-30 15:40

    Pre-processing: Each computer system processes complete log file and prepares Unique URLs list with count against them.

    Getting top URLs:

    1. Calculate URL counts at each computer system
    2. Collating process at a central system(Virtual)
      • Send URLs with count to a central processing unit one by one in DESC order(i.e from top most)
      • At central system collate incoming URL details
      • Repeat until sum of all the counts from incoming URLs is less than count of Tenth URL in the master list. A vital step to be absolutely certain

    PS: You'll have top ten URLs across systems not necessarily in that order. To get the actual order you can reverse collation. For a given URL on top ten get individual count from dist-computers and form final order.

提交回复
热议问题