Find the longest word given a collection

前端 未结 9 1575
既然无缘
既然无缘 2021-01-30 14:03

It is a google interview question and I find most answers online using HashMap or similar data structure. I am trying to find a solution using Trie if possible. Anybody could gi

相关标签:
9条回答
  • 2021-01-30 14:34

    Disclaimer: this is not a trie solution, but I still think it's an idea worth exploring.

    Create some sort of hash function that only accounts for letters in a word and not their order (no collisions should be possible except in the case of permutations). For example, ABCD and DCBA both generate the same hash (but ABCDD does not). Generate such a hash table containing every word in the dictionary, using chaining to link collisions (on the other hand, unless you have a strict requirement to find "all" longest words and not just one, you can just drop collisions, which are just permutations, and forgo the whole chaining).

    Now, if your search set is 4 characters long, for example A, B, C, D, then as a näive search you check the following hashes to see if they are already contained in the dictionary:

    hash(A), hash(B), hash(C), hash(D) // 1-combinations
    hash(AB), hash(AC), hash(AD), hash(BC), hash(BD), hash(CD) // 2-combinations
    hash(ABC), hash(ABD), hash(ACD), hash(BCD) // 3-combinations
    hash(ABCD) // 4-combinations
    

    If you search the hashes in that order, the last match you find will be the longest one.

    This ends up having a run time which is dependent on the length of the search set rather than the length of the dictionary. If M is the number of characters in the search set, then the number of hash lookups is the sum M choose 1 + M choose 2 + M choose 3 + ... + M choose M which is also the size of the powerset of the search set, so it's O(2^M). At first glance this sounds really bad since it's exponential, but to put things in perspective, if your search set is size 10 there will only be around 1000 lookups, which is probably a lot smaller than your dictionary size in a practical real world scenario. At M = 15 we get 32000 lookups, and really, how many English words are there that are longer than 15 characters?

    There are two (alternate) ways I can think of to optimize it though:

    1) Search for longer matches first e.g. M-combinations then (M-1)-combinations, etc. As soon as you find a match, you can stop! Chances are you will only cover a small fraction of your search space, probably at worst half.

    2) Search for shorter matches first (1-combos, 2-combos, etc). Say you come up with a miss at level 2 (for example, no string in your dictionary is composed only of A and B). Use an auxiliary data structure (a bitmap perhaps) that allows you to check if any word in the dictionary is even partially composed of A and B (in contrast to your primary hash table which checks for complete composition). If you get a miss on the secondary bitmap also, then you know that you can skip all higher level combinations including A and B (i.e. you can skip hash(ABC), hash(ABD), and hash(ABCD) because no words contain both A and B). This leverages the Apriori principle and would drastically reduce the search space as M grows and misses become more frequent. EDIT: I realize that the details I abstract away relating to the "auxiliary data structure" are significant. As I think more about this idea, I realize it is leaning toward a complete dictionary scan as a subprocedure, which defeats the point of this entire approach. Still, it seems there should be a way to use the Apriori principle here.

    0 讨论(0)
  • 2021-01-30 14:35

    No Java code. You can figure that out for yourself.

    Assuming that we need to do this lots of times, here's what I'd do:

    • I'd start by creating "signatures" for each word in the dictionary consisting of 26 bits, where bit[letter] is set iff the word contains one (or more) instances of letter. These signatures can be encoded as a Java int.

    • Then create a mapping that maps signatures to lists of words with that signature.

    To do a search using the precomputed map:

    • Create the signature for the set of letters you want to find the words for.

    • Then iterate over the keys of the mapping looking for keys where (key & (~signature) == 0). That gives you a short list of "possibles" that don't contain any letter that is not in the required letter set.

    • Iterate over the short list looking for words with the right number of each of the required letters, recording the longest hit.


    Notes:

    1. While the primary search is roughly O(N) on the number of words in the dictionary, the test is extremely cheap.

    2. This approach has the advantage of requiring a relatively small in-memory data structure, that (most likely) has good locality. That is likely to be conducive to faster searches.


    Here's an idea for speeding up the O(N) search step above.

    Starting with the signature map above, create (precompute) derivative maps for all words that do contain specific pairs letters; i.e. one for words containing AB, for AC, BC, ... and for YZ. Then if you are looking for words containing (say) P and Q, you can just scan the PQ derivative map. That will reduce O(N) step by roughly 26^2 ... at the cost of more memory for the extra maps.

    That can be extended to 3 or more letters, but the downside is the explosion in memory usage.

    Another potential tweak is to (somehow) bias the selection of the initial letter pair towards letters/pairs that occur infrequently. But that adds an up-front overhead which could be greater than the (average) saving you get from searching a shorter list.

    0 讨论(0)
  • 2021-01-30 14:36

    Assuming a large dictionary and a letter set with less than 10 or 11 members (such as the example given), the fastest method is build a tree containing the possible words the letters can make, then match the word list against the tree. In other words your letter tree's root has seven subnodes: { a, e, f, g, i, r, q }. The branch of "a" has six subnodes { e, f, g, i, r, q }, etc. The tree thus contains every possible word which can be made with these letters.

    Go through each word in the list and match it to the tree. If the match is maximum length (uses all the letters), you are done. If the word is less then max, but longer than any previously matched word, remember it, this is the "longest word so far" (LWSF). Ignore any words that have a length equal to less than the LWSF. Also, ignore any words which are longer than the length of the letter list.

    This is a linear time algorithm once the letter tree is constructed, so as long as the word list is significantly larger than the letter tree, it is fastest method.

    0 讨论(0)
  • 2021-01-30 14:37

    First off, nice question. The interviewer wants to see how you tackle the problem. In those kinds of problems you are required to analyse the problem and carefully choose a data structure.

    In this case, two datastructures come into my mind: HashMaps and Tries. HashMaps are not a good fit, because you don't have a complete key you want to lookup (you can use an inverted index based on maps, but you said you already found those solutions). You only have the parts- that is where the Trie is the best fit.

    So the idea with tries is that you can ignore branches of characters that are not in your dictionary while traversing the tree.

    In your case, the tree looks like this (I left out the branching for non-branching paths):

    *
       a
         bacus
       d 
         deltoid
       g
         a
           gaff
         i
           giraffe
       m 
         microphone
       r 
         reef
       q 
         qar
    

    So at each level of this trie, we look at the children of the current node and check if the child's character is in our dictionary.

    If yes: We go deeper in that tree and remove the child's character from our dictionary

    This goes on until you hit a leaf (no children anymore), here you know that this word contains all characters in this dictionary. This is a possible candidate. Now we want to go back in the tree until we find another match that we can compare. If the newest found match is smaller, discard it, if longer this is our possible best match candidate now.

    Some day, the recusion will finish and you'll end up with the desired output.

    Note that this works if there is a single longest word, otherwise you would have to return a list of candidates (this is the unknown part of the interview where you are required to ask what the interviewer wants to see as a solution).

    So you have required the Java code, here it is with a simplistic Trie and the single longest word version:

    public class LongestWord {
    
      class TrieNode {
        char value;
        List<TrieNode> children = new ArrayList<>();
        String word;
    
        public TrieNode() {
        }
    
        public TrieNode(char val) {
          this.value = val;
        }
    
        public void add(char[] array) {
          add(array, 0);
        }
    
        public void add(char[] array, int offset) {
          for (TrieNode child : children) {
            if (child.value == array[offset]) {
              child.add(array, offset + 1);
              return;
            }
          }
          TrieNode trieNode = new TrieNode(array[offset]);
          children.add(trieNode);
          if (offset < array.length - 1) {
            trieNode.add(array, offset + 1);
          } else {
            trieNode.word = new String(array);
          }
        }    
      }
    
      private TrieNode root = new TrieNode();
    
      public LongestWord() {
        List<String> asList = Arrays.asList("abacus", "deltoid", "gaff", "giraffe",
            "microphone", "reef", "qar");
        for (String word : asList) {
          root.add(word.toCharArray());
        }
      }
    
      public String search(char[] cs) {
        return visit(root, cs);
      }
    
      public String visit(TrieNode n, char[] allowedCharacters) {
        String bestMatch = null;
        if (n.children.isEmpty()) {
          // base case, leaf of the trie, use as a candidate
          bestMatch = n.word;
        }
    
        for (TrieNode child : n.children) {
          if (contains(allowedCharacters, child.value)) {
            // remove this child's value and descent into the trie
            String result = visit(child, remove(allowedCharacters, child.value));
            // if the result wasn't null, check length and set
            if (bestMatch == null || result != null
                && bestMatch.length() < result.length()) {
              bestMatch = result;
            }
          }
        }
        // always return the best known match thus far
        return bestMatch;
      }
    
      private char[] remove(char[] allowedCharacters, char value) {
        char[] newDict = new char[allowedCharacters.length - 1];
        int index = 0;
        for (char x : allowedCharacters) {
          if (x != value) {
            newDict[index++] = x;
          } else {
            // we removed the first hit, now copy the rest
            break;
          }
        }
        System.arraycopy(allowedCharacters, index + 1, newDict, index,
            allowedCharacters.length - (index + 1));
    
        return newDict;
      }
    
      private boolean contains(char[] allowedCharacters, char value) {
        for (char x : allowedCharacters) {
          if (value == x) {
            return true;
          }
        }
        return false;
      }
    
      public static void main(String[] args) {
        LongestWord lw = new LongestWord();
        String longestWord = lw.search(new char[] { 'a', 'e', 'f', 'f', 'g', 'i',
            'r', 'q' });
        // yields giraffe
        System.out.println(longestWord);
      }
    
    }
    

    I also can only suggest reading the book Cracking the Coding Interview: 150 Programming Questions and Solutions, it guides you through the decision-making and construction those algorithms specialized on interview questions.

    0 讨论(0)
  • 2021-01-30 14:39

    I think the above answers missed the key point. We have a space with 27 dimensions, the first one is the length and the others the coordinates of each letter. In that space we have points, which are words. The first coordinate of a word is his length. The other coordinates are, for each letter-dimension is the number of occurrences of that letter in that word. For example the words abacus, deltoid, gaff, giraffe, microphone, reef, qar, abcdefghijklmnopqrstuvwxyz have coordinates

    [3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    [6, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]
    [7, 0, 0, 0, 2, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
    [4, 1, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    [7, 1, 0, 0, 0, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    [10, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    [4, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    [3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    [26, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    

    The good structure for a set with coordinates is a R-tree or a R*-Tree. Given your collection [x0, x1, ..., x26], you have to ask all the words that contains at most xi letter, for each letter. Your search is in O(log N), where N is the number of words in your dictionary. However you don't want to look at the biggest word in all the words that match your query. This is why the first dimension is important.

    You know that the length of the biggest word is between 0 and X, where X=sum(x_i, i=1..26). You can search iteratively from X to 1, but you can also do a binary search algorithm for the length of the query. You use the first dimension of your array as the query. You start from a=X to b=X/2. If their is at least a match, you search from a to (a+b)/2, else you search from b to b-(a-b)/2=(3b-a)/2. You do that until you have b-a=1. You now have the biggest length and all the matches with this length.

    This algorithm is asymptotically much more efficient than the algorithms above. The time complexity is in O(ln(N)×ln(X)). The implementation depend on the R-tree library you use.

    0 讨论(0)
  • 2021-01-30 14:52

    Groovy (almost Java):

    def letters = ['a', 'e', 'f', 'f', 'g', 'i', 'r', 'q']
    def dictionary = ['abacus', 'deltoid', 'gaff', 'giraffe', 'microphone', 'reef', 'qar']
    println dictionary
        .findAll{ it.toList().intersect(letters).size() == it.size() }
        .sort{ -it.size() }.head()
    

    The choice of collection type to hold the dictionary is irrelevant to the algorithm. If you're supposed to implement a trie, that's one thing. Otherwise, just create one from an appropriate library to hold the data. Neither Java nor Groovy has one in its standard library that I'm aware of.

    0 讨论(0)
提交回复
热议问题