How to break down a given text into words from the dictionary?

前端 未结 4 1458
一整个雨季
一整个雨季 2021-02-02 15:02

This is an interview question. Suppose you have a string text and a dictionary (a set of strings). How do you break down text into substri

相关标签:
4条回答
  • 2021-02-02 15:20

    Approach 1- Trie looks to be a close fit here. Generate trie of the words in english dictionary. This trie building is one time cost. After trie is built then your string can be easily compared letter by letter. if at any point you encounter a leaf in the trie you can assume you found a word, add this to a list & move on with your traversal. Do the traversal till you have reached the end of your string. The list is output.

    Time Complexity for search - O(word_length).

    Space Complexity - O(charsize * word_length * no_words). Size of your dictionary.

    Approach 2 - I have heard of Suffix Trees, never used them but it might be useful here.

    Approach 3 - is more pedantic & a lousy alternative. you have already suggested this.

    You could try the other way around. Run through the dict is check for sub-string match. Here I am assuming the keys in dict are the words of the english dictionary /usr/share/dict/words. So psuedo code looks something like this -

    (list) splitIntoWords(String str, dict d)
    {
        words = []
        for (word in d)
        {
            if word in str
                words.append(word);
        }
        return words;
    }
    

    Complexity - O(n) running through entire dict + O(1) for substring match.

    Space - worst case O(n) if len(words) == len(dict)

    As others have pointed out, this does require backtracking.

    0 讨论(0)
  • 2021-02-02 15:21

    There is a very thorough writeup for the solution to this problem in this blog post.

    The basic idea is just to memoize the function you've written and you'll have an O(n^2) time, O(n) space algorithm.

    0 讨论(0)
  • 2021-02-02 15:28

    This solution assumes the existence of Trie data structure for the dictionary. Further, for each node in Trie, assumes the following functions:

    1. node.IsWord() : Returns true if the path to that node is a word
    2. node.IsChild(char x): Returns true if there exists a child with label x
    3. node.GetChild(char x): Returns the child node with label x
    Function annotate( String str, int start, int end, int root[], TrieNode node):
    i = start
    while i<=end:
        if node.IsChild ( str[i]):
            node = node.GetChild( str[i] )
            if node.IsWord():
                root[i+1] = start
            i+=1
        else:
            break;
    
    end = len(str)-1
    root = [-1 for i in range(len(str)+1)]
    for start= 0:end:
        if start = 0 or root[start]>=0:
            annotate(str, start, end, root, trieRoot)
    
    index  0  1  2  3  4  5  6  7  8  9  10  11
    str:   t  h  i  s  i  s  a  t  e  x  t
    root: -1 -1 -1 -1  0 -1  4  6 -1  6 -1   7
    

    I will leave the part for you to list the words that make up the string by reverse traversing the root.

    Time complexity is O(nk) where n is the length of the string and k is the length of the longest word in the dictionary.

    PS: I am assuming following words in the dictionary: this, is, a, text, ate.

    0 讨论(0)
  • 2021-02-02 15:29

    You can solve this problem using Dynamic Programming and Hashing.

    Calculate the hash of every word in the dictionary. Use the hash function you like the most. I would use something like (a1 * B ^ (n - 1) + a2 * B ^ (n - 2) + ... + an * B ^ 0) % P, where a1a2...an is a string, n is the length of the string, B is the base of the polynomial and P is a large prime number. If you have the hash value of a string a1a2...an you can calculate the hash value of the string a1a2...ana(n+1) in constant time: (hashValue(a1a2...an) * B + a(n+1)) % P.

    The complexity of this part is O(N * M), where N is the number of words in the dictionary and M is the length of the longest word in the dictionary.

    Then, use a DP function like this:

       bool vis[LENGHT_OF_STRING];
       bool go(char str[], int length, int position)
       {
          int i;
    
          // You found a set of words that can solve your task.
          if (position == length) {
              return true;
          }
    
          // You already have visited this position. You haven't had luck before, and obviously you won't have luck this time.
          if (vis[position]) {
             return false;
          }
          // Mark this position as visited.
          vis[position] = true;
    
          // A possible improvement is to stop this loop when the length of substring(position, i) is greater than the length of the longest word in the dictionary.
          for (i = position; position < length; i++) {
             // Calculate the hash value of the substring str(position, i);
             if (hashValue is in dict) {
                // You can partition the substring str(i + 1, length) in a set of words in the dictionary.
                if (go(i + 1)) {
                   // Use the corresponding word for hashValue in the given position and return true because you found a partition for the substring str(position, length).
                   return true;
                }
             }
          }
    
          return false;
       }
    

    The complexity of this algorithm is O(N * M), where N is the length of the string and M is the length of the longest word in the dictionary or O(N ^ 2), depending if you coded the improvement or not.

    So the total complexity of the algorithm will be: O(N1 * M) + O(N2 * M) (or O(N2 ^ 2)), where N1 is the number of words in the dictionary, M is the length of the longest word in the dictionary and N2 is the lenght of the string).

    If you can't think of a nice hash function (where there are not any collision), other possible solution is to use Tries or a Patricia trie (if the size of the normal trie is very large) (I couldn't post links for these topics because my reputation is not high enough to post more than 2 links). But in you use this, the complexity of your algorithm will be O(N * M) * O(Time needed to find a word in the trie), where N is the length of the string and M is the length of the longest word in the dictionary.

    I hope it helps, and I apologize for my poor english.

    0 讨论(0)
提交回复
热议问题