Edit: There are problems with this algorithm, specifically that recursively merging lists makes this a polynomial-runtime algorithm. But I'll leave it here as an example of a flawed algorithm.
You cannot discard any words from your chunks because there may be one word that exists 100 times in only one chunk, and another that exists one time in each of 100 different chunks.
But you can still work with chunks, in a way similar to a MapReduce algorithm. You map each chunk to a word list (including count), then you reduce by recursively merging the word lists into one.
In the map step, map each word to a count for each chunk. Sort alphabetically, not by count and store the lists to disk. Now you can merge the lists pairwise linearly without keeping more than two words in memory:
- Let A and B be the list files to merge, and R the result file
- Read one line with word+count from A, call the word
a
- Read one line with word+count from B, call the word
b
- Compare the words alphabetically:
- If
a
= b
:
- Sum their counts
- Write the word and new count to R
- Go to 2
- If
a
> b
:
- Write
b
including its count to R
- Read a new line
b
from B
- Go to 4
- If
a
< b
:
- Write
a
including its count to R
- Read a new line
a
from A
- Go to 4
Continue to do this pairwise merge until all files are merged into a single list. Then you can scan the result list once and keep the ten most frequent words.