问题
Long story short: in 1986 an interviewer asked Donald Knuth to write a program that takes a text and a number N in input, and lists the N most used words sorted by their frequencies. Knuth produced a 10-pages Pascal program, to which Douglas McIlroy replied with the following 6-lines shell script:
tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q
Read the full story at http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/ .
Of course they had very different goals: Knuth was showing his concepts of literate programming and built everything from scratch, while McIlroy used a few common UNIX utilities to achieve the shortest source code.
My question is: how bad is that?
(Purely from a runtime-speed point of view, since I'm pretty sure we all agree that 6 lines of code is easier to understand/maintain than 10 pages, literate programming or not.)
I can understand that sort -rn | sed ${1}q
may not be the most efficient way to extract the common words, but what's wrong with tr -sc A-za-z '\n' | tr A-Z a-z
? It looks pretty good to me.
About sort | uniq -c
, is that a terribly slow way to determine the frequencies?
A few considerations:
tr
should be linear time (?)sort
I'm not sure of, but I'm assuming it's not that baduniq
should be linear time too- spawning processes should be linear time (in the number of processes)
回答1:
The Unix
script has a few linear operations and 2 sorts. It will be calculation order O(n log(n))
.
For Knuth algorithm for taking only the top N: http://en.wikipedia.org/wiki/Selection_algorithm Where you can have a few options in time and space complexity of the algorithm, but theoretically they can be faster for some typical examples with large number of (different) words.
So Knuth could be faster. Certainly because the English dictionary has limited size. It could turn log(n)
in some large constant, though maybe consuming a lot of memory.
But maybe this question is better suited for https://cstheory.stackexchange.com/
回答2:
Doug McIlroy's solution has time complexity O(T log T), where T is the total number of words. This is due to the first sort
.
For comparison, here are three faster solutions of the same problem:
Here is a C++ implementation with the upper bound time complexity O((T + N) log N), but practically – nearly linear, close to O(T + N log N).
Below is a fast Python implementation. Internally, it uses hash dictionary and heap with time complexity O(T + N log Q), where Q is the number of unique words:
import collections, re, sys
filename = sys.argv[1]
k = int(sys.argv[2]) if len(sys.argv)>2 else 10
reg = re.compile('[a-z]+')
counts = collections.Counter()
for line in open(filename):
counts.update(reg.findall(line.lower()))
for i, w in counts.most_common(k):
print(i, w)
And another Unix shell solution using AWK. It has time complexity O(T + Q log Q):
awk -v FS="[^a-zA-Z]+" '
{
for (i=1; i<=NF; i++)
freq[tolower($i)]++;
}
END {
for (word in freq)
print(freq[word] " " word)
}
' | sort -rn | head -10
CPU time comparison (in seconds):
bible32 bible256 Asymptotical
C++ (prefix tree + heap) 5.659 44.730 O((T + N) log N)
Python (Counter) 14.366 115.855 O(T + N log Q)
AWK + sort 21.548 176.411 O(T + Q log Q)
McIlroy (tr + sort + uniq) 60.531 690.906 O(T log T)
Notes:
- T >= Q, typically Q >> N (N is a small constant)
- bible32 is Bible concatenated 32 times (135 MB), bible256 – 256 times respectively (1.1 GB)
As you can see, McIlroy's solution can be easily beaten in CPU time even using standard Unix tools. However, his solution is still very elegant, easy to debug and, after all, it is not terrible in performance either, unless you start using it for multigigabyte files. Bad implementation of more complex algorithms in C/C++ or Haskell could easily run much slower than his pipeline (I've seen it!).
来源:https://stackoverflow.com/questions/25957835/wordcount-how-inefficient-is-mcilroys-solution