Find Top 10 Most Frequent visited URl, data is stored across network

后端 未结 2 1604
甜味超标
甜味超标 2021-01-31 12:18

Source: Google Interview Question

Given a large network of computers, each keeping log files of visited urls, find the top ten most visited URLs

相关标签:
2条回答
  • 2021-01-31 13:07

    It says you can't use map-reduce directly which is a hint the author of the question wants you to think how map reduce works, so we will just mimic the actions of map-reduce:

    1. pre-processing: let R be the number of servers in cluster, give each server unique id from 0,1,2,...,R-1
    2. (map) For each (string,id) - send the tuple to the server which has the id hash(string) % R.
    3. (reduce) Once step 2 is done (simple control communication), produce the (string,count) of the top 10 strings per server. Note that the tuples where those sent in step2 to this particular server.
    4. (map) Each server will send all his top 10 to 1 server (let it be server 0). It should be fine, there are only 10*R of those records.
    5. (reduce) Server 0 will yield the top 10 across the network.

    Notes:

    • The problem with the algorithm, like most big-data algorithms that don't use frameworks is handling failing servers. MapReduce takes care of it for you.
    • The above algorithm can be translated to a 2 phases map-reduce algorithm pretty straight forward.
    0 讨论(0)
  • 2021-01-31 13:07

    In the worst case any algorithm, which does not require transmitting the whole frequency table, is going to fail. We can create a trivial case where the global top-10s are all at the bottom of every individual machines list.

    If we assume that the frequency of URIs follow Zipf's law, we can come up with effecive solutions. One such solution follows.

    Each machine sends top-K elements. K depends solely on the bandwidth available. One master machine aggregates the frequencies and finds the 10th maximum frequency value "V10" (note that this is a lower limit. Since the global top-10 may not be in top-K of every machine, the sum is incomplete).

    In the next step every machine sends a list of URIs whose frequency is V10/M (where M is the number of machines). The union of all such is sent back to every machine. Each machines, in turn, sends back the frequency for this particular list. A master aggregates this list into top-10 list.

    0 讨论(0)
提交回复
热议问题