How to break down a given text into words from the dictionary?

前端 未结 4 1452
一整个雨季
一整个雨季 2021-02-02 15:02

This is an interview question. Suppose you have a string text and a dictionary (a set of strings). How do you break down text into substri

4条回答
  •  谎友^
    谎友^ (楼主)
    2021-02-02 15:29

    You can solve this problem using Dynamic Programming and Hashing.

    Calculate the hash of every word in the dictionary. Use the hash function you like the most. I would use something like (a1 * B ^ (n - 1) + a2 * B ^ (n - 2) + ... + an * B ^ 0) % P, where a1a2...an is a string, n is the length of the string, B is the base of the polynomial and P is a large prime number. If you have the hash value of a string a1a2...an you can calculate the hash value of the string a1a2...ana(n+1) in constant time: (hashValue(a1a2...an) * B + a(n+1)) % P.

    The complexity of this part is O(N * M), where N is the number of words in the dictionary and M is the length of the longest word in the dictionary.

    Then, use a DP function like this:

       bool vis[LENGHT_OF_STRING];
       bool go(char str[], int length, int position)
       {
          int i;
    
          // You found a set of words that can solve your task.
          if (position == length) {
              return true;
          }
    
          // You already have visited this position. You haven't had luck before, and obviously you won't have luck this time.
          if (vis[position]) {
             return false;
          }
          // Mark this position as visited.
          vis[position] = true;
    
          // A possible improvement is to stop this loop when the length of substring(position, i) is greater than the length of the longest word in the dictionary.
          for (i = position; position < length; i++) {
             // Calculate the hash value of the substring str(position, i);
             if (hashValue is in dict) {
                // You can partition the substring str(i + 1, length) in a set of words in the dictionary.
                if (go(i + 1)) {
                   // Use the corresponding word for hashValue in the given position and return true because you found a partition for the substring str(position, length).
                   return true;
                }
             }
          }
    
          return false;
       }
    

    The complexity of this algorithm is O(N * M), where N is the length of the string and M is the length of the longest word in the dictionary or O(N ^ 2), depending if you coded the improvement or not.

    So the total complexity of the algorithm will be: O(N1 * M) + O(N2 * M) (or O(N2 ^ 2)), where N1 is the number of words in the dictionary, M is the length of the longest word in the dictionary and N2 is the lenght of the string).

    If you can't think of a nice hash function (where there are not any collision), other possible solution is to use Tries or a Patricia trie (if the size of the normal trie is very large) (I couldn't post links for these topics because my reputation is not high enough to post more than 2 links). But in you use this, the complexity of your algorithm will be O(N * M) * O(Time needed to find a word in the trie), where N is the length of the string and M is the length of the longest word in the dictionary.

    I hope it helps, and I apologize for my poor english.

提交回复
热议问题