How to break down a given text into words from the dictionary?

前端未结

关注

 4  1452

一整个雨季 2021-02-02 15:02

This is an interview question. Suppose you have a string text and a dictionary (a set of strings). How do you break down text into substri

4条回答

谎友^ (楼主)

2021-02-02 15:29
You can solve this problem using Dynamic Programming and Hashing.

Calculate the hash of every word in the dictionary. Use the hash function you like the most. I would use something like (a1 * B ^ (n - 1) + a2 * B ^ (n - 2) + ... + an * B ^ 0) % P, where a1a2...an is a string, n is the length of the string, B is the base of the polynomial and P is a large prime number. If you have the hash value of a string a1a2...an you can calculate the hash value of the string a1a2...ana(n+1) in constant time: (hashValue(a1a2...an) * B + a(n+1)) % P.

The complexity of this part is O(N * M), where N is the number of words in the dictionary and M is the length of the longest word in the dictionary.

Then, use a DP function like this:
```
   bool vis[LENGHT_OF_STRING];
   bool go(char str[], int length, int position)
   {
      int i;

      // You found a set of words that can solve your task.
      if (position == length) {
          return true;
      }

      // You already have visited this position. You haven't had luck before, and obviously you won't have luck this time.
      if (vis[position]) {
         return false;
      }
      // Mark this position as visited.
      vis[position] = true;

      // A possible improvement is to stop this loop when the length of substring(position, i) is greater than the length of the longest word in the dictionary.
      for (i = position; position < length; i++) {
         // Calculate the hash value of the substring str(position, i);
         if (hashValue is in dict) {
            // You can partition the substring str(i + 1, length) in a set of words in the dictionary.
            if (go(i + 1)) {
               // Use the corresponding word for hashValue in the given position and return true because you found a partition for the substring str(position, length).
               return true;
            }
         }
      }

      return false;
   }
```
The complexity of this algorithm is O(N * M), where N is the length of the string and M is the length of the longest word in the dictionary or O(N ^ 2), depending if you coded the improvement or not.

So the total complexity of the algorithm will be: O(N1 * M) + O(N2 * M) (or O(N2 ^ 2)), where N1 is the number of words in the dictionary, M is the length of the longest word in the dictionary and N2 is the lenght of the string).

If you can't think of a nice hash function (where there are not any collision), other possible solution is to use Tries or a Patricia trie (if the size of the normal trie is very large) (I couldn't post links for these topics because my reputation is not high enough to post more than 2 links). But in you use this, the complexity of your algorithm will be O(N * M) * O(Time needed to find a word in the trie), where N is the length of the string and M is the length of the longest word in the dictionary.

I hope it helps, and I apologize for my poor english.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...