WordCount: how inefficient is McIlroy's solution?

喜欢而已 提交于 2019-11-30 03:43:33
Pieter21

The Unix script has a few linear operations and 2 sorts. It will be calculation order O(n log(n)).

For Knuth algorithm for taking only the top N: http://en.wikipedia.org/wiki/Selection_algorithm Where you can have a few options in time and space complexity of the algorithm, but theoretically they can be faster for some typical examples with large number of (different) words.

So Knuth could be faster. Certainly because the English dictionary has limited size. It could turn log(n) in some large constant, though maybe consuming a lot of memory.

But maybe this question is better suited for https://cstheory.stackexchange.com/

Doug McIlroy's solution has time complexity O(T log T), where T is the total number of words. This is due to the first sort.

For comparison, here are three faster solutions of the same problem:

Here is a C++ implementation with the upper bound time complexity O((T + N) log N), but practically – nearly linear, close to O(T + N log N).

Below is a fast Python implementation. Internally, it uses hash dictionary and heap with time complexity O(T + N log Q), where Q is the number of unique words:

import collections, re, sys

filename = sys.argv[1]
k = int(sys.argv[2]) if len(sys.argv)>2 else 10
reg = re.compile('[a-z]+')

counts = collections.Counter()
for line in open(filename):
    counts.update(reg.findall(line.lower()))
for i, w in counts.most_common(k):
    print(i, w)

And another Unix shell solution using AWK. It has time complexity O(T + Q log Q):

awk -v FS="[^a-zA-Z]+" '
{
    for (i=1; i<=NF; i++)
        freq[tolower($i)]++;
}
END {
    for (word in freq)
        print(freq[word] " " word)
}
' | sort -rn | head -10

CPU time comparison (in seconds):

                                     bible32       bible256       Asymptotical
C++ (prefix tree + heap)             5.659         44.730         O((T + N) log N)
Python (Counter)                     14.366        115.855        O(T + N log Q)
AWK + sort                           21.548        176.411        O(T + Q log Q)
McIlroy (tr + sort + uniq)           60.531        690.906        O(T log T)

Notes:

  • T >= Q, typically Q >> N (N is a small constant)
  • bible32 is Bible concatenated 32 times (135 MB), bible256 – 256 times respectively (1.1 GB)

As you can see, McIlroy's solution can be easily beaten in CPU time even using standard Unix tools. However, his solution is still very elegant, easy to debug and, after all, it is not terrible in performance either, unless you start using it for multigigabyte files. Bad implementation of more complex algorithms in C/C++ or Haskell could easily run much slower than his pipeline (I've seen it!).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!