Parsing one terabyte of text and efficiently counting the number of occurrences of each word

后端 未结 16 506
野趣味
野趣味 2020-11-30 17:21

Recently I came across an interview question to create a algorithm in any language which should do the following

  1. Read 1 terabyte of content
  2. Make a co
相关标签:
16条回答
  • 2020-11-30 17:45

    You can try a map-reduce approach for this task. The advantage of map-reduce is scalability, so even for 1TB, or 10TB or 1PB - the same approach will work, and you will not need to do a lot of work in order to modify your algorithm for the new scale. The framework will also take care for distributing the work among all machines (and cores) you have in your cluster.

    First - Create the (word,occurances) pairs.
    The pseudo code for this will be something like that:

    map(document):
      for each word w:
         EmitIntermediate(w,"1")
    
    reduce(word,list<val>):
       Emit(word,size(list))
    

    Second you can find the ones with the topK highest occurances easily with a single iteration over the pairs, This thread explains this concept. The main idea is to hold a min-heap of top K elements, and while iterating - make sure the heap always contains the top K elements seen so far. When you are done - the heap contains the top K elements.

    A more scalable (though slower if you have few machines) alternative is you use the map-reduce sorting functionality, and sort the data according to the occurances, and just grep the top K.

    0 讨论(0)
  • 2020-11-30 17:47

    I'd be quite tempted to use a DAWG (wikipedia, and a C# writeup with more details). It's simple enough to add a count field on the leaf nodes, efficient memory wise and performs very well for lookups.

    EDIT: Though have you tried simply using a Dictionary<string, int>? Where <string, int> represents word and count? Perhaps you're trying to optimize too early?

    editor's note: This post originally linked to this wikipedia article, which appears to be about another meaning of the term DAWG: A way of storing all substrings of one word, for efficient approximate string-matching.

    0 讨论(0)
  • 2020-11-30 17:47

    MapReduce
    WordCount can be acheived effciently through mapreduce using hadoop. https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0 Large files can be parsed through it.It uses multiple nodes in cluster to perform this operation.

    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
           String line = value.toString();
           StringTokenizer tokenizer = new StringTokenizer(line);
           while (tokenizer.hasMoreTokens()) {
                 word.set(tokenizer.nextToken());
                 output.collect(word, one);
           }
             }
    
    public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
             public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
           int sum = 0;
               while (values.hasNext()) {
                 sum += values.next().get();
               }
           output.collect(key, new IntWritable(sum));
         }
           }
    
    0 讨论(0)
  • 2020-11-30 17:48

    Three things of note for this.

    Specifically: File to large to hold in memory, word list (potentially) too large to hold in memory, word count can be too large for a 32 bit int.

    Once you get through those caveats, it should be straight forward. The game is managing the potentially large word list.

    If it's any easier (to keep your head from spinning).

    "You're running a Z-80 8 bit machine, with 65K of RAM and have a 1MB file..."

    Same exact problem.

    0 讨论(0)
  • 2020-11-30 17:52

    It depends on the requirements, but if you can afford some error, streaming algorithms and probabilistic data structures can be interesting because they are very time and space efficient and quite simple to implement, for instance:

    • Heavy hitters (e.g., Space Saving), if you are interested only in the top n most frequent words
    • Count-min sketch, to get an estimated count for any word

    Those data structures require only very little constant space (exact amount depends on error you can tolerate).

    See http://alex.smola.org/teaching/berkeley2012/streams.html for an excellent description of these algorithms.

    0 讨论(0)
  • 2020-11-30 17:53

    The method below will only read your data once and can be tuned for memory sizes.

    • Read the file in chunks of say 1GB
    • For each chunk make a list of say the 5000 most occurring words with their frequency
    • Merge the lists based on frequency (1000 lists with 5000 words each)
    • Return the top 10 of the merged list

    Theoretically you might miss words, althoug I think that chance is very very small.

    0 讨论(0)
提交回复
热议问题