Parallel top ten algorithm for distributed data

前端 未结 5 1012
难免孤独
难免孤独 2021-01-30 15:11

This is an interview question. Suppose there are a few computers and each computer keeps a very large log file of visited URLs. Find the top ten most visited URLs.

5条回答
  •  抹茶落季
    2021-01-30 15:18

    Given the scale of the log files and the generic nature of the question, this is quite a difficult problem to solve. I do not think that there is one best algorithm for all situations. It depends on the nature of the contents of the log files. For example, take the corner case that all URLs are all unique in all log files. In that case, basically any solution will take a long time to draw that conclusion (if it even gets that far...), and there is not even an answer to your question because there is no top-ten.

    I do not have a watertight algorithm that I can present, but I would explore a solution that uses histograms of hash values of the URLs as opposed to the URLs themselves. These histograms can be calculated by means of one-pass file reads, so it can deal with arbitrary size log files. In pseudo-code, I would go for something like this:

    • Use a hash function with a limited target space (say 10,000, note that colliding hash-values are expected) to calculate the hash value of each item in the log file and count how many times each of the has value occurs. Communicate the resulting histogram to a server (although it is probably also possible to avoid a central server at all by multicasting the result to every other node -- but I will stick with the more obvious server-approach here)
    • The server should merge the histograms and communicate the result back. Depending on the distribution of the URLs, there might be a number of clearly visible peaks already, containing the top-visited URLs.
    • Each of the nodes should then focus on the peaks in the histogram. It should go trough its log file again, use an additional hash function (again with a limited target space) to calculate a new hash-histogram for those URLs that have their first hash value in one of the peaks (the number of peaks to focus on would be a parameter to be tuned in the algorithm, depending on the distribution of the URLs), and calculate a second histogram with the new hash values. The result should be communicated to the server.
    • The server should merge the results again and analyse the new histogram versus the original histogram. Depending on clearly visible peaks, it might be able to draw conclusions about the two hash values of the top ten URLs already. Or it might have to instruct the machines to calculate more hash values with the second hash function, and probably after that go through a third pass of hash-calculations with yet another hash function. This has to continue until a conclusion can be drawn from the collective group of histograms what the hash values of the peak URLs are, and then the nodes can identify the different URLs from that.

    Note that this mechanism will require tuning and optimization with regard to several aspects of the algorithm and hash-functions. It will also need orchestration by the server as to which calculations should be done at any time. It probably will also need to set some boundaries in order to conclude when no conclusion can be drawn, in other words when the "spectrum" of URL hash values is too flat to make it worth the effort to continue calculations.

    This approach should work well if there is a clear distribution in the URLs though. I suspect that, practically speaking, the question only makes sense in that case anyway.

提交回复
热议问题