Find the 10 most frequently used words in a large book [duplicate]

前端未结

关注

 2  751

慢半拍i

相关标签:

2条回答

谎友^

2021-01-03 11:15
Edit: There are problems with this algorithm, specifically that recursively merging lists makes this a polynomial-runtime algorithm. But I'll leave it here as an example of a flawed algorithm.

You cannot discard any words from your chunks because there may be one word that exists 100 times in only one chunk, and another that exists one time in each of 100 different chunks.

But you can still work with chunks, in a way similar to a MapReduce algorithm. You map each chunk to a word list (including count), then you reduce by recursively merging the word lists into one.

In the map step, map each word to a count for each chunk. Sort alphabetically, not by count and store the lists to disk. Now you can merge the lists pairwise linearly without keeping more than two words in memory:
1. Let A and B be the list files to merge, and R the result file
2. Read one line with word+count from A, call the word a
3. Read one line with word+count from B, call the word b
4. Compare the words alphabetically:
  - If a = b:
    - Sum their counts
    - Write the word and new count to R
    - Go to 2
  - If a > b:
    - Write b including its count to R
    - Read a new line b from B
    - Go to 4
  - If a < b:
    - Write a including its count to R
    - Read a new line a from A
    - Go to 4
Continue to do this pairwise merge until all files are merged into a single list. Then you can scan the result list once and keep the ten most frequent words.
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光说笑

2021-01-03 11:19

This is a classic problem in the field of streaming algorithms. There's clearly no way to do this that works in certain degenerate cases; you'll need to settle for a bunch of elements that are approximately (in a well-defined sense) the top k words in your stream. I don't know any classic references, but a quick Google brought me to this. It seems to have a nice survey on various techniques for doing streaming top-K. You might check the references therein for other ideas.

One other idea (and one that doesn't fly in the streaming model) is just to randomly sample as many words as will fit into memory, sort-and-uniq them, and do another pass over the file counting hits of the words in your sample. Then you can easily find the top k.

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题