Parsing one terabyte of text and efficiently counting the number of occurrences of each word

后端未结

关注

 16  506

野趣味

Recently I came across an interview question to create a algorithm in any language which should do the following

Read 1 terabyte of content
Make a co

相关标签:

16条回答

暗喜

2020-11-30 17:45
You can try a map-reduce approach for this task. The advantage of map-reduce is scalability, so even for 1TB, or 10TB or 1PB - the same approach will work, and you will not need to do a lot of work in order to modify your algorithm for the new scale. The framework will also take care for distributing the work among all machines (and cores) you have in your cluster.

First - Create the (word,occurances) pairs.
The pseudo code for this will be something like that:
```
map(document):
  for each word w:
     EmitIntermediate(w,"1")

reduce(word,list<val>):
   Emit(word,size(list))
```
Second you can find the ones with the topK highest occurances easily with a single iteration over the pairs, This thread explains this concept. The main idea is to hold a min-heap of top K elements, and while iterating - make sure the heap always contains the top K elements seen so far. When you are done - the heap contains the top K elements.

A more scalable (though slower if you have few machines) alternative is you use the map-reduce sorting functionality, and sort the data according to the occurances, and just grep the top K.
0 讨论(0)
发布评论:

提交评论
- 加载中...
眼角桃花

2020-11-30 17:47

I'd be quite tempted to use a DAWG (wikipedia, and a C# writeup with more details). It's simple enough to add a count field on the leaf nodes, efficient memory wise and performs very well for lookups.

EDIT: Though have you tried simply using a Dictionary<string, int>? Where <string, int> represents word and count? Perhaps you're trying to optimize too early?

editor's note: This post originally linked to this wikipedia article, which appears to be about another meaning of the term DAWG: A way of storing all substrings of one word, for efficient approximate string-matching.

0 讨论(0)
发布评论:

提交评论
- 加载中...

花落未央

2020-11-30 17:47

MapReduce
WordCount can be acheived effciently through mapreduce using hadoop. https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0 Large files can be parsed through it.It uses multiple nodes in cluster to perform this operation.

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
       String line = value.toString();
       StringTokenizer tokenizer = new StringTokenizer(line);
       while (tokenizer.hasMoreTokens()) {
             word.set(tokenizer.nextToken());
             output.collect(word, one);
       }
         }

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
         public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
       int sum = 0;
           while (values.hasNext()) {
             sum += values.next().get();
           }
       output.collect(key, new IntWritable(sum));
     }
       }

0 讨论(0)

时光说笑

2020-11-30 17:48

Three things of note for this.

Specifically: File to large to hold in memory, word list (potentially) too large to hold in memory, word count can be too large for a 32 bit int.

Once you get through those caveats, it should be straight forward. The game is managing the potentially large word list.

If it's any easier (to keep your head from spinning).

"You're running a Z-80 8 bit machine, with 65K of RAM and have a 1MB file..."

Same exact problem.

0 讨论(0)
发布评论:

提交评论
- 加载中...
遇见更好的自我

2020-11-30 17:52
It depends on the requirements, but if you can afford some error, streaming algorithms and probabilistic data structures can be interesting because they are very time and space efficient and quite simple to implement, for instance:
- Heavy hitters (e.g., Space Saving), if you are interested only in the top n most frequent words
- Count-min sketch, to get an estimated count for any word
Those data structures require only very little constant space (exact amount depends on error you can tolerate).

See http://alex.smola.org/teaching/berkeley2012/streams.html for an excellent description of these algorithms.
0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2020-11-30 17:53
The method below will only read your data once and can be tuned for memory sizes.
- Read the file in chunks of say 1GB
- For each chunk make a list of say the 5000 most occurring words with their frequency
- Merge the lists based on frequency (1000 lists with 5000 words each)
- Return the top 10 of the merged list
Theoretically you might miss words, althoug I think that chance is very very small.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页