Parsing one terabyte of text and efficiently counting the number of occurrences of each word

后端 未结 16 508
野趣味
野趣味 2020-11-30 17:21

Recently I came across an interview question to create a algorithm in any language which should do the following

  1. Read 1 terabyte of content
  2. Make a co
相关标签:
16条回答
  • 2020-11-30 17:54

    A different solution could be using an SQL table, and let the system handle the data as good as it can. First create the table with the single field word, for each word in the collection.

    Then use the query (sorry for syntax issue, my SQL is rusty - this is a pseudo-code actually):

    SELECT DISTINCT word, COUNT(*) AS c FROM myTable GROUP BY word ORDER BY c DESC
    

    The general idea is to first generate a table (which is stored on disk) with all words, and then use a query to count and sort (word,occurances) for you. You can then just take the top K from the retrieved list.


    To all: If I indeed have any syntax or other issues in the SQL statement: feel free to edit

    0 讨论(0)
  • 2020-11-30 17:54

    Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.

    http://en.wikipedia.org/wiki/Trie

    here is implementation of tire in java
    http://algs4.cs.princeton.edu/52trie/TrieST.java.html

    0 讨论(0)
  • 2020-11-30 17:57

    Storm is the technogy to look at. It separates the role of data input (spouts ) from processors (bolts). The chapter 2 of the storm book solves your exact problem and describes the system architecture very well - http://www.amazon.com/Getting-Started-Storm-Jonathan-Leibiusky/dp/1449324010

    Storm is more real time processing as opposed to batch processing with Hadoop. If your data is per existing then you can distribute loads to different spouts and spread them for processing to different bolts .

    This algorithm also will enable support for data growing beyond terabytes as the date will be analysed as it arrives in real time.

    0 讨论(0)
  • 2020-11-30 17:59

    Well the 1st thought is to manage a dtabase in form of hashtable /Array or whatever to save each words occurence, but according to the data size i would rather:

    • Get the 1st 10 found words where occurence >= 2
    • Get how many times these words occure in the entire string and delete them while counting
    • Start again, once you have two sets of 10 words each you get the most occured 10 words of both sets
    • Do the same for the rest of the string(which dosnt contain these words anymore).

    You can even try to be more effecient and start with 1st found 10 words where occurence >= 5 for example or more, if not found reduce this value until 10 words found. Throuh this you have a good chance to avoid using memory intensivly saving all words occurences which is a huge amount of data, and you can save scaning rounds (in a good case)

    But in the worst case you may have more rounds than in a conventional algorithm.

    By the way its a problem i would try to solve with a functional programing language rather than OOP.

    0 讨论(0)
  • 2020-11-30 18:05

    First, I only recently "discovered" the Trie data structure and zeFrenchy's answer was great for getting me up to speed on it.

    I did see in the comments several people making suggestions on how to improve its performance, but these were only minor tweaks so I'd thought I'd share with you what I found to be the real bottle neck... the ConcurrentDictionary.

    I'd wanted to play around with thread local storage and your sample gave me a great opportunity to do that and after some minor changes to use a Dictionary per thread and then combine the dictionaries after the Join() saw the performance improve ~30% (processing 20MB 100 times across 8 threads went from ~48 sec to ~33 sec on my box).

    The code is pasted below and you'll notice not much changed from the approved answer.

    P.S. I don't have more than 50 reputation points so I could not put this in a comment.

    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.Linq;
    using System.Text;
    using System.Threading;
    
    namespace WordCount
    {
        class MainClass
        {
            public static void Main(string[] args)
            {
                Console.WriteLine("Counting words...");
                DateTime start_at = DateTime.Now;
                Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();
                if (args.Length == 0)
                {
                    args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
                                          "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
                }
    
                List<ThreadLocal<TrieNode>> roots;
                if (args.Length == 0)
                {
                    roots = new List<ThreadLocal<TrieNode>>(1);
                }
                else
                {
                    roots = new List<ThreadLocal<TrieNode>>(args.Length);
    
                    foreach (string path in args)
                    {
                        ThreadLocal<TrieNode> root = new  ThreadLocal<TrieNode>(() =>
                        {
                            return new TrieNode(null, '?');
                        });
    
                        roots.Add(root);
    
                        DataReader new_reader = new DataReader(path, root);
                        Thread new_thread = new Thread(new_reader.ThreadRun);
                        readers.Add(new_reader, new_thread);
                        new_thread.Start();
                    }
                }
    
                foreach (Thread t in readers.Values) t.Join();
    
                foreach(ThreadLocal<TrieNode> root in roots.Skip(1))
                {
                    roots[0].Value.CombineNode(root.Value);
                    root.Dispose();
                }
    
                DateTime stop_at = DateTime.Now;
                Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
                Console.WriteLine();
                Console.WriteLine("Most commonly found words:");
    
                List<TrieNode> top10_nodes = new List<TrieNode> { roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value, roots[0].Value };
                int distinct_word_count = 0;
                int total_word_count = 0;
                roots[0].Value.GetTopCounts(top10_nodes, ref distinct_word_count, ref total_word_count);
    
                top10_nodes.Reverse();
                foreach (TrieNode node in top10_nodes)
                {
                    Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
                }
    
                roots[0].Dispose();
    
                Console.WriteLine();
                Console.WriteLine("{0} words counted", total_word_count);
                Console.WriteLine("{0} distinct words found", distinct_word_count);
                Console.WriteLine();
                Console.WriteLine("done.");
                Console.ReadLine();
            }
        }
    
        #region Input data reader
    
        public class DataReader
        {
            static int LOOP_COUNT = 100;
            private TrieNode m_root;
            private string m_path;
    
            public DataReader(string path, ThreadLocal<TrieNode> root)
            {
                m_root = root.Value;
                m_path = path;
            }
    
            public void ThreadRun()
            {
                for (int i = 0; i < LOOP_COUNT; i++) // fake large data set buy parsing smaller file multiple times
                {
                    using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
                    using (StreamReader sreader = new StreamReader(fstream))
                    {
                        string line;
                        while ((line = sreader.ReadLine()) != null)
                        {
                            string[] chunks = line.Split(null);
                            foreach (string chunk in chunks)
                            {
                                m_root.AddWord(chunk.Trim());
                            }
                        }
                    }
                }
            }
        }
    
        #endregion
    
        #region TRIE implementation
    
        public class TrieNode : IComparable<TrieNode>
        {
            private char m_char;
            public int m_word_count;
            private TrieNode m_parent = null;
            private Dictionary<char, TrieNode> m_children = null;
    
            public TrieNode(TrieNode parent, char c)
            {
                m_char = c;
                m_word_count = 0;
                m_parent = parent;
                m_children = new Dictionary<char, TrieNode>();            
            }
    
            public void CombineNode(TrieNode from)
            {
                foreach(KeyValuePair<char, TrieNode> fromChild in from.m_children)
                {
                    char keyChar = fromChild.Key;
                    if (!m_children.ContainsKey(keyChar))
                    {
                        m_children.Add(keyChar, new TrieNode(this, keyChar));
                    }
                    m_children[keyChar].m_word_count += fromChild.Value.m_word_count;
                    m_children[keyChar].CombineNode(fromChild.Value);
                }
            }
    
            public void AddWord(string word, int index = 0)
            {
                if (index < word.Length)
                {
                    char key = word[index];
                    if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
                    {
                        if (!m_children.ContainsKey(key))
                        {
                            m_children.Add(key, new TrieNode(this, key));
                        }
                        m_children[key].AddWord(word, index + 1);
                    }
                    else
                    {
                        // not a letter! retry with next char
                        AddWord(word, index + 1);
                    }
                }
                else
                {
                    if (m_parent != null) // empty words should never be counted
                    {
                        m_word_count++;                        
                    }
                }
            }
    
            public int GetCount(string word, int index = 0)
            {
                if (index < word.Length)
                {
                    char key = word[index];
                    if (!m_children.ContainsKey(key))
                    {
                        return -1;
                    }
                    return m_children[key].GetCount(word, index + 1);
                }
                else
                {
                    return m_word_count;
                }
            }
    
            public void GetTopCounts(List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
            {
                if (m_word_count > 0)
                {
                    distinct_word_count++;
                    total_word_count += m_word_count;
                }
                if (m_word_count > most_counted[0].m_word_count)
                {
                    most_counted[0] = this;
                    most_counted.Sort();
                }
                foreach (char key in m_children.Keys)
                {
                    m_children[key].GetTopCounts(most_counted, ref distinct_word_count, ref total_word_count);
                }
            }
    
            public override string ToString()
            {
                return BuildString(new StringBuilder()).ToString();
            }
    
            private StringBuilder BuildString(StringBuilder builder)
            {
                if (m_parent == null)
                {
                    return builder;
                }
                else
                {
                    return m_parent.BuildString(builder).Append(m_char);
                }
            }
    
            public int CompareTo(TrieNode other)
            {
                return this.m_word_count.CompareTo(other.m_word_count);
            }
        }
    
        #endregion
    }
    
    0 讨论(0)
  • 2020-11-30 18:05

    Very interesting question. It relates more to logic analysis than coding. With the assumption of English language and valid sentences it comes easier.

    You don't have to count all words, just the ones with a length less than or equal to the average word length of the given language (for English is 5.1). Therefore you will not have problems with memory.

    As for reading the file you should choose a parallel mode, reading chunks (size of your choice) by manipulating file positions for white spaces. If you decide to read chunks of 1MB for example all chunks except the first one should be a bit wider (+22 bytes from left and +22 bytes from right where 22 represents the longest English word - if I'm right). For parallel processing you will need a concurrent dictionary or local collections that you will merge.

    Keep in mind that normally you will end up with a top ten as part of a valid stop word list (this is probably another reverse approach which is also valid as long as the content of the file is ordinary).

    0 讨论(0)
提交回复
热议问题