Parallel top ten algorithm for distributed data

前端未结

关注

 5  1025

难免孤独 2021-01-30 15:11

This is an interview question. Suppose there are a few computers and each computer keeps a very large log file of visited URLs. Find the top ten most visited URLs.

5条回答

星月不相逢 (楼主)

2021-01-30 15:40
Pre-processing: Each computer system processes complete log file and prepares Unique URLs list with count against them.

Getting top URLs:
1. Calculate URL counts at each computer system
2. Collating process at a central system(Virtual)
  - Send URLs with count to a central processing unit one by one in DESC order(i.e from top most)
  - At central system collate incoming URL details
  - Repeat until sum of all the counts from incoming URLs is less than count of Tenth URL in the master list. A vital step to be absolutely certain
PS: You'll have top ten URLs across systems not necessarily in that order. To get the actual order you can reverse collation. For a given URL on top ten get individual count from dist-computers and form final order.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...