How to find smallest substring which contains all characters from a given string?

后端未结

关注

 15  1037

I have recently come across an interesting question on strings. Suppose you are given following:

Input string1: \"this is a test string\"
Input strin


                      
              相关标签:


      
      
        
          15条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情书的邮戳        
                
              
                            
                2020-12-02 08:03
              
            
            
                                                                       
Java code for the approach discussed above:

private static Map<Character, Integer> frequency;
private static Set<Character> charsCovered;
private static Map<Character, Integer> encountered;
/**
 * To set the first match index as an intial start point
 */
private static boolean hasStarted = false;
private static int currentStartIndex = 0;
private static int finalStartIndex = 0;
private static int finalEndIndex = 0;
private static int minLen = Integer.MAX_VALUE;
private static int currentLen = 0;
/**
 * Whether we have already found the match and now looking for other
 * alternatives.
 */
private static boolean isFound = false;
private static char currentChar;

public static String findSmallestSubStringWithAllChars(String big, String small) {

    if (null == big || null == small || big.isEmpty() || small.isEmpty()) {
        return null;
    }

    frequency = new HashMap<Character, Integer>();
    instantiateFrequencyMap(small);
    charsCovered = new HashSet<Character>();
    int charsToBeCovered = frequency.size();
    encountered = new HashMap<Character, Integer>();

    for (int i = 0; i < big.length(); i++) {
        currentChar = big.charAt(i);
        if (frequency.containsKey(currentChar) && !isFound) {
            if (!hasStarted && !isFound) {
                hasStarted = true;
                currentStartIndex = i;
            }
            updateEncounteredMapAndCharsCoveredSet(currentChar);
            if (charsCovered.size() == charsToBeCovered) {
                currentLen = i - currentStartIndex;
                isFound = true;
                updateMinLength(i);
            }
        } else if (frequency.containsKey(currentChar) && isFound) {
            updateEncounteredMapAndCharsCoveredSet(currentChar);
            if (currentChar == big.charAt(currentStartIndex)) {
                encountered.put(currentChar, encountered.get(currentChar) - 1);
                currentStartIndex++;
                while (currentStartIndex < i) {
                    if (encountered.containsKey(big.charAt(currentStartIndex))
                            && encountered.get(big.charAt(currentStartIndex)) > frequency.get(big
                                    .charAt(currentStartIndex))) {
                        encountered.put(big.charAt(currentStartIndex),
                                encountered.get(big.charAt(currentStartIndex)) - 1);
                    } else if (encountered.containsKey(big.charAt(currentStartIndex))) {
                        break;
                    }
                    currentStartIndex++;
                }
            }
            currentLen = i - currentStartIndex;
            updateMinLength(i);
        }
    }
    System.out.println("start: " + finalStartIndex + " finalEnd : " + finalEndIndex);
    return big.substring(finalStartIndex, finalEndIndex + 1);
}

private static void updateMinLength(int index) {
    if (minLen > currentLen) {
        minLen = currentLen;
        finalStartIndex = currentStartIndex;
        finalEndIndex = index;
    }

}

private static void updateEncounteredMapAndCharsCoveredSet(Character currentChar) {
    if (encountered.containsKey(currentChar)) {
        encountered.put(currentChar, encountered.get(currentChar) + 1);
    } else {
        encountered.put(currentChar, 1);
    }

    if (encountered.get(currentChar) >= frequency.get(currentChar)) {
        charsCovered.add(currentChar);
    }
}

private static void instantiateFrequencyMap(String str) {

    for (char c : str.toCharArray()) {
        if (frequency.containsKey(c)) {
            frequency.put(c, frequency.get(c) + 1);
        } else {
            frequency.put(c, 1);
        }
    }

}

public static void main(String[] args) {

    String big = "this is a test string";
    String small = "tist";
    System.out.println("len: " + big.length());
    System.out.println(findSmallestSubStringWithAllChars(big, small));
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2020-12-02 08:04
              
            
            
                                                                       
You can do a histogram sweep in O(N+M) time and O(1) space where N is the number of characters in the first string and M is the number of characters in the second.

It works like this:


Make a histogram of the second string's characters (key operation is hist2[ s2[i] ]++).
Make a cumulative histogram of the first string's characters until that histogram contains every character that the second string's histogram contains (which I will call "the histogram condition").
Then move forwards on the first string, subtracting from the histogram, until it fails to meet the histogram condition.  Mark that bit of the first string (before the final move) as your tentative substring.
Move the front of the substring forwards again until you meet the histogram condition again.  Move the end forwards until it fails again.  If this is a shorter substring than the first, mark that as your tentative substring.
Repeat until you've passed through the entire first string.
The marked substring is your answer.


Note that by varying the check you use on the histogram condition, you can choose either to have the same set of characters as the second string, or at least as many characters of each type.  (Its just the difference between a[i]>0 && b[i]>0 and a[i]>=b[i].)

You can speed up the histogram checks if you keep a track of which condition is not satisfied when you're trying to satisfy it, and checking only the thing that you decrement when you're trying to break it.  (On the initial buildup, you count how many items you've satisfied, and increment that count every time you add a new character that takes the condition from false to true.)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2020-12-02 08:04
              
            
            
                                                                       
Here's an O(n) solution. The basic idea is simple: for each starting index, find the least ending index such that the substring contains all of the necessary letters. The trick is that the least ending index increases over the course of the function, so with a little data structure support, we consider each character at most twice.

In Python:

from collections import defaultdict

def smallest(s1, s2):
    assert s2 != ''
    d = defaultdict(int)
    nneg = [0]  # number of negative entries in d
    def incr(c):
        d[c] += 1
        if d[c] == 0:
            nneg[0] -= 1
    def decr(c):
        if d[c] == 0:
            nneg[0] += 1
        d[c] -= 1
    for c in s2:
        decr(c)
    minlen = len(s1) + 1
    j = 0
    for i in xrange(len(s1)):
        while nneg[0] > 0:
            if j >= len(s1):
                return minlen
            incr(s1[j])
            j += 1
        minlen = min(minlen, j - i)
        decr(s1[i])
    return minlen

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  别跟我提以往        
                
              
                            
                2020-12-02 08:04
              
            
            
                                                                       
This is an approach using prime numbers to avoid one loop, and replace it with multiplications. Several other minor optimizations can be made.


Assign a unique prime number to any of the characters that you want to find, and 1 to the uninteresting characters.
Find the product of a matching string by multiplying the prime number with the number of occurrences it should have. Now this product can only be found if the same prime factors are used.
Search the string from the beginning, multiplying the respective prime number as you move into a running product.
If the number is greater than the correct sum, remove the first character and divide its prime number out of your running product.
If the number is less than the correct sum, include the next character and multiply it into your running product.
If the number is the same as the correct sum you have found a match, slide beginning and end to next character and continue searching for other matches.
Decide which of the matches is the shortest.


Gist

charcount = { 'a': 3, 'b' : 1 };
str = "kjhdfsbabasdadaaaaasdkaaajbajerhhayeom"

def find (c, s):
  Ns = len (s)

  C = list (c.keys ())
  D = list (c.values ())

  # prime numbers assigned to the first 25 chars
  prmsi = [ 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89 , 97]

  # primes used in the key, all other set to 1
  prms = []
  Cord = [ord(c) - ord('a') for c in C]

  for e,p in enumerate(prmsi):
    if e in Cord:
      prms.append (p)
    else:
      prms.append (1)

  # Product of match
  T = 1
  for c,d in zip(C,D):
    p = prms[ord (c) - ord('a')]
    T *= p**d

  print ("T=", T)

  t = 1 # product of current string
  f = 0
  i = 0

  matches = []
  mi = 0
  mn = Ns
  mm = 0

  while i < Ns:
    k = prms[ord(s[i]) - ord ('a')]
    t *= k

    print ("testing:", s[f:i+1])

    if (t > T):
      # included too many chars: move start
      t /= prms[ord(s[f]) - ord('a')] # remove first char, usually division by 1
      f += 1 # increment start position
      t /= k # will be retested, could be replaced with bool

    elif t == T:
      # found match
      print ("FOUND match:", s[f:i+1])
      matches.append (s[f:i+1])

      if (i - f) < mn:
        mm = mi
        mn = i - f

      mi += 1

      t /= prms[ord(s[f]) - ord('a')] # remove first matching char

      # look for next match
      i += 1
      f += 1

    else:
      # no match yet, keep searching
      i += 1

  return (mm, matches)


print (find (charcount, str))



  (note: this answer was originally posted to a duplicate question, the original answer is now deleted.)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-12-02 08:04
              
            
            
                                                                       
def minimum_window(s, t, min_length = 100000):
    d = {}
    for x in t:
        if x in d:
            d[x]+= 1
        else:
            d[x] = 1

    tot = sum([y for x,y in d.iteritems()])
    l = []
    ind = 0 
    for i,x in enumerate(s):
        if ind == 1:
            l = l + [x]
        if x in d:
            tot-=1
            if not l:
                ind = 1
                l = [x]

        if tot == 0:
            if len(l)<min_length:
                min_length = len(l)
                min_length = minimum_window(s[i+1:], t, min_length)

return min_length

l_s = "ADOBECODEBANC"
t_s = "ABC"

min_length = minimum_window(l_s, t_s)

if min_length == 100000:
      print "Not found"
else:
      print min_length

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-12-02 08:07
              
            
            
                                                                       
Edit: apparently there's an O(n) algorithm  (cf. algorithmist's answer).  Obviously this have this will beat the [naive] baseline described below!


Too bad I gotta go... I'm a bit suspicious that we can get O(n).  I'll check in tomorrow to see the winner ;-)   Have fun!

Tentative algorithm:

The general idea is to sequentially try and use a character from str2 found in str1 as the start of a search (in either/both directions) of all the other letters of str2. By keeping a "length of best match so far" value, we can abort searches when they exceed this.  Other heuristics can probably be used to further abort suboptimal (so far) solutions.  The choice of the order of the starting letters in str1 matters much; it is suggested to start with the letter(s) of str1 which have the lowest count and to try with the other letters, of an increasing count, in subsequent attempts.

  [loose pseudo-code]
  - get count for each letter/character in str1  (number of As, Bs etc.)
  - get count for each letter in str2
  - minLen = length(str1) + 1  (the +1 indicates you're not sure all chars of 
                                str2 are in str1)
  - Starting with the letter from string2 which is found the least in string1,
    look for other letters of Str2, in either direction of str1, until you've 
    found them all (or not, at which case response = impossible => done!). 
    set x = length(corresponding substring of str1).
 - if (x < minLen), 
         set minlen = x, 
         also memorize the start/len of the str1 substring.
 - continue trying with other letters of str1 (going the up the frequency
   list in str1), but abort search as soon as length(substring of strl) 
   reaches or exceed minLen.  
   We can find a few other heuristics that would allow aborting a 
   particular search, based on [pre-calculated ?] distance between a given
   letter in str1 and some (all?) of the letters in str2.
 - the overall search terminates when minLen = length(str2) or when 
   we've used all letters of str1 (which match one letter of str2)
   as a starting point for the search

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
3
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复