Algorithm for finding all of the shared substrings of any length between 2 strings, and then counting occurrences in string 2?

前端 未结 4 509
慢半拍i
慢半拍i 2020-12-30 04:03

I\'ve run into an unusual challenge and so far I\'m unable to determine the most efficient algorithm to attack this.


Given the following 2 strings as a

相关标签:
4条回答
  • 2020-12-30 04:48

    Here is some code illustrating the idea I presented in the comments above. Although it is runnable C++ code, it is more pseudo-code in the sense that the utilized data structures are surely not optimal but they allow a clear view on the algorithm.

    struct Occurrence
    {
        //The vectors contain indices to the first character of the occurrence in ...
        std::vector<size_t> s1;  // ... string 1 and ...
        std::vector<size_t> s2;  // ... string 2.
    };
    
    int main()
    {
        //If you cannot load the entire strings in memory, a memory-mapped file might be
        //worth considering
        std::string s1 = "ABCDE512ABC361EG51D";
        std::string s2 = "ADE5AHDW4131EG1DG5C";
    
        //These vectors store the occurrences of substrings for the current and next length
        std::vector<Occurrence> occurrences, nextOccurrences;
        int length = 1;
    
        std::map<char, Occurrence> occurrenceMap;
        //Initialize occurrences
        for (int i = 0; i < s1.length(); ++i)
            occurrenceMap[s1[i]].s1.push_back(i);
        for (int i = 0; i < s2.length(); ++i)
            occurrenceMap[s2[i]].s2.push_back(i);
    
        for (auto& pair : occurrenceMap)
        {
            if (pair.second.s1.size() > 0 && pair.second.s2.size() > 0)
                occurrences.push_back(std::move(pair.second));
        }
    
        do
        {
            nextOccurrences.clear();
    
            std::cout << "Length " << length << std::endl;
            for(auto& o : occurrences)
            {
                std::cout << std::string(s1.c_str() + o.s1[0], length) << " occurred "
                          << o.s1.size() << " / " << o.s2.size() << " times." << std::endl;
    
                //Expand the occurrence
                occurrenceMap.clear();
                for (auto p : o.s1)
                {
                    if (p + length < s1.length())
                        occurrenceMap[s1[p + length]].s1.push_back(p);
                }                   
                for (auto p : o.s2)
                {
                    if (p + length < s2.length())
                    occurrenceMap[s2[p + length]].s2.push_back(p);
                }
                for (auto& pair : occurrenceMap)
                {
                    if (pair.second.s1.size() > 0 && pair.second.s2.size() > 0)
                        nextOccurrences.push_back(std::move(pair.second));
                }
            }
    
            ++length;
            std::swap(occurrences, nextOccurrences);
    
        } while (!occurrences.empty());
    
    
        return 0;
    }
    

    Output:

    Length 1
    1 occurred 3 / 3 times.
    3 occurred 1 / 1 times.
    5 occurred 2 / 2 times.
    A occurred 2 / 2 times.
    C occurred 2 / 1 times.
    D occurred 2 / 3 times.
    E occurred 2 / 2 times.
    G occurred 1 / 2 times.
    Length 2
    1D occurred 1 / 1 times.
    1E occurred 1 / 1 times.
    DE occurred 1 / 1 times.
    E5 occurred 1 / 1 times.
    EG occurred 1 / 1 times.
    G5 occurred 1 / 1 times.
    Length 3
    1EG occurred 1 / 1 times.
    DE5 occurred 1 / 1 times.
    

    The most amount of memory will be used during initialization because there will be an entry for every character of both input strings. If you know the approximate length of the strings, you can choose a more appropriate index data type than size_t. The amount of memory needed is in the order of the input size. So two 100 MB files should be no problem for common computers. After the initialization (more specifically, after the first iteration of the loop), most of these data will be deleted because it is not needed any more.

    0 讨论(0)
  • 2020-12-30 04:54

    Here's a C implementation based on traversing the suffix array of the concatenation of the inputs, with the help of the longest common prefix array. You can replace the programming-contest-grade (O(n log^2 n)) suffix array implementation with a real one (O(n) or O(n log n)) for a large performance improvement. (EDIT: did this, with some other changes reflecting the asker's new requirements: https://github.com/eisenstatdavid/commonsub .)

    #include <inttypes.h>
    #include <limits.h>
    #include <stdbool.h>
    #include <stdio.h>
    #include <stdlib.h>
    
    typedef int_fast32_t I32;
    
    #define Constrain(expression) _Static_assert(expression, #expression)
    Constrain(CHAR_BIT == 8);
    #define InputMaxBytes 80000000
    Constrain(InputMaxBytes <= (INT_LEAST32_MAX - 2) / 2);
    #define MaxLen (2 * InputMaxBytes + 2)
    Constrain(MaxLen <= INT_FAST32_MAX / 2);
    
    static I32 Len;
    static I32 Begin2;
    static signed char Buf[MaxLen];
    static int_least32_t SufArr[MaxLen];
    static int_least32_t SufRank[MaxLen];
    static int_least32_t NewRank[MaxLen];
    static int_least32_t *const LongCommPre = NewRank;  // aliased to save space
    static uint_least64_t Bitmap2[(MaxLen >> 6) + 1];
    static int_least32_t SparseCount2[(MaxLen >> 6) + 1];
    static int_least32_t *const Stack = SufRank;  // aliased to save space
    
    static void Slurp(const char *filename) {
      FILE *stream = fopen(filename, "r");
      if (stream == NULL) goto fail;
      I32 n = fread(Buf + Len, sizeof *Buf, InputMaxBytes + 1, stream);
      if (ferror(stream)) goto fail;
      if (n > InputMaxBytes) {
        fprintf(stderr, "%s: file is too large; increase InputMaxBytes\n",
                filename);
        exit(EXIT_FAILURE);
      }
      for (I32 i = 0; i < n; i++) {
        if (Buf[Len + i] < 0) {
          fprintf(stderr,
                  "%s: file contains non-ASCII byte at offset %" PRIdFAST32 "\n",
                  filename, i);
          exit(EXIT_FAILURE);
        }
      }
      Len += n;
      if (fclose(stream) == EOF) goto fail;
      return;
    fail:
      perror(filename);
      exit(EXIT_FAILURE);
    }
    
    static I32 Radix;
    
    static int CompareRankPairs(const void *iPtr, const void *jPtr) {
      I32 i = *(const int_least32_t *)iPtr;
      I32 j = *(const int_least32_t *)jPtr;
      if (SufRank[i] < SufRank[j]) return -1;
      if (SufRank[i] > SufRank[j]) return 1;
      I32 iRank = i + Radix < Len ? SufRank[i + Radix] : -2;
      I32 jRank = j + Radix < Len ? SufRank[j + Radix] : -2;
      if (iRank < jRank) return -1;
      if (iRank > jRank) return 1;
      return 0;
    }
    
    static void BuildSuffixArray(void) {
      for (I32 i = 0; i < Len; i++) {
        SufArr[i] = i;
        SufRank[i] = Buf[i];
      }
      for (Radix = 1; true; Radix *= 2) {
        qsort(SufArr, Len, sizeof *SufArr, CompareRankPairs);
        NewRank[0] = 0;
        for (I32 i = 1; i < Len; i++) {
          NewRank[i] = CompareRankPairs(&SufArr[i - 1], &SufArr[i]) == 0
                           ? NewRank[i - 1]
                           : NewRank[i - 1] + 1;
        }
        for (I32 i = 0; i < Len; i++) {
          SufRank[SufArr[i]] = NewRank[i];
        }
        if (NewRank[Len - 1] == Len - 1) break;
      }
    
      I32 lenCommPre = 0;
      for (I32 i = 0; i < Len; i++) {
        if (SufRank[i] == Len - 1) {
          LongCommPre[SufRank[i]] = -1;
          continue;
        }
        while (Buf[i + lenCommPre] == Buf[SufArr[SufRank[i] + 1] + lenCommPre]) {
          lenCommPre++;
        }
        LongCommPre[SufRank[i]] = lenCommPre;
        if (lenCommPre > 0) lenCommPre--;
      }
    }
    
    static I32 PopCount(uint_fast64_t x) {
      I32 v = 0;
      while (x != 0) {
        x &= x - 1;
        v++;
      }
      return v;
    }
    
    static void BuildCumCount2(void) {
      for (I32 i = 0; i < Len; i++) {
        if (SufArr[i] >= Begin2) {
          Bitmap2[i >> 6] |= UINT64_C(1) << (i & 63);
          SparseCount2[i >> 6]++;
        }
      }
      for (I32 i = 0; i < (Len >> 6); i++) {
        SparseCount2[i + 1] += SparseCount2[i];
      }
    }
    
    static I32 CumCount2(I32 i) {
      return SparseCount2[i >> 6] - PopCount(Bitmap2[i >> 6] >> (i & 63));
    }
    
    static void FindCommonStrings(void) {
      I32 lenCommPre = -1;
      for (I32 i = 0; i < Len; i++) {
        while (lenCommPre > LongCommPre[i]) {
          I32 begin = Stack[lenCommPre];
          I32 end = i + 1;
          I32 count2 = CumCount2(end) - CumCount2(begin);
          if (count2 > 0 && count2 < end - begin && lenCommPre > 0) {
            printf("%" PRIdFAST32 "\t%.*s\n", count2, (int)lenCommPre,
                   Buf + SufArr[begin]);
          }
          lenCommPre--;
        }
        while (lenCommPre < LongCommPre[i]) {
          lenCommPre++;
          Stack[lenCommPre] = i;
        }
      }
    }
    
    int main(int argc, char *argv[]) {
      if (argc != 3) {
        fputs("usage: commonsub needle haystack\n", stderr);
        exit(EXIT_FAILURE);
      }
      Len = 0;
      Slurp(argv[1]);
      Buf[Len] = -1;
      Len++;
      Begin2 = Len;
      Slurp(argv[2]);
      Buf[Len] = -2;  // sentinel
      BuildSuffixArray();
      if (false) {
        for (I32 i = 0; i < Len; i++) {
          printf("%" PRIdFAST32 "\t%" PRIdLEAST32 "\t%" PRIdLEAST32 "\t%.*s\n", i,
                 SufArr[i], LongCommPre[i], (int)(Len - SufArr[i]),
                 Buf + SufArr[i]);
        }
      }
      BuildCumCount2();
      FindCommonStrings();
    }
    
    0 讨论(0)
  • 2020-12-30 04:59

    After looking at the two strings and thinking about this for a bit I've done this procedure in my head and now I'm going to translate it into steps.

    String 1: ABCDE512ABC361EG51D  // S1
    String 2: ADE5AHDW4131EG1DG5C  // S2
    

    When I was thinking about this we can compare characters and or substrings from S1 to S2 while keeping track of occurrences.

    S1[0] = 'A'  compare S2[0]  = 'A' = true : found A in S2 at location 0
    S1[0] = 'A'  compare S2[1]  = 'D' = false
    S1[0] = 'A'  compare S2[2]  = 'E' = false
    S1[0] = 'A'  compare S2[3]  = '5' = false
    S1[0] = 'A'  compare S2[4]  = 'A' = true : found A in S2 at location 4
    S1[0] = 'A'  compare S2[5]  = 'H' = false
    S1[0] = 'A'  compare S2[6]  = 'D' = false
    S1[0] = 'A'  compare S2[7]  = 'W' = false
    S1[0] = 'A'  compare S2[8]  = '4' = false
    S1[0] = 'A'  compare S2[9]  = '1' = false
    S1[0] = 'A'  compare S2[10] = '3' = false
    S1[0] = 'A'  compare S2[11] = '1' = false; 
    S1[0] = 'A'  compare S2[12] = 'E' = false; 
    S1[0] = 'A'  compare S2[13] = 'G' = false;
    S1[0] = 'A'  compare S2[14] = '1' = false;
    S1[0] = 'A'  compare S2[15] = 'D' = false;
    S1[0] = 'A'  compare S2[16] = 'G' = false;
    S1[0] = 'A'  compare S2[17] = '5' = false;
    S1[0] = 'A'  compare S2[18] = 'C' = false;
    
    // End of First Search - Occurrences of 'A' in S2 is 2 at locations {0,4}
    
    // Next Iteration
    String 1: ABCDE512ABC361EG51D  // S1
    String 2: ADE5AHDW4131EG1DG5C  // S2
    
    // Repeat this for all single characters Of S1 against S2
    'A' in S2 = 2  at {0,4}
    'B' in S2 = 0 
    'C' in S2 = 1  at {18}
    'D' in S2 = 3  at {1,6,15}
    'E' in S2 = 2  at {2,12}
    '5' in S2 = 2  at {3,17}
    '1' in S2 = 3  at {9,11,14}
    '2' in S2 = 0
    'A' Already Found Above Skip
    'B' Already Found Above Skip
    'C' Already Found Above Skip
    '3' in S2 = 1  at {10}
    '6' in S2 = 0
    '1' Already Found Above Skip
    'E' Already Found Above Skip
    'G' in S2 = 2  at {13, 16}
    '5' Already Found Above Skip
    '1' Already Found Above Skip
    'D' Already Found Above Skip
    

    This would conclude the first set of iterations for doing all single characters and as you can see we also built a list and a map or sets of not only occurrences but also their locations and we can store them for future references. So if we begin to search for S1[0 & 1] in S2 we know that S1[1] does not exist in S2 so we can break and don't need to go down that chain and since we can break out of that branch we can also skip over doing S1[1 & ...N] and move directly to S1[2] and we know that there is only 1 occurrence of S1[2] which is 'C' in S2 located at {18} which is the end of the string so there is no need to look for S1[2 & ... N] so we can skip over this and move to S1[3] which is 'D' and we know that it does exist in S2 at {1,6,15} so now we can begin our search of S1[3 & ... N] beginning with S2[1 & ... N] then again do the same search of S1[3 & ... N] starting at S2[6 & ... N] and finally again starting S2[15 & ...N] then we have now found all sub strings that start with D in S2 and we can save their occurrences; however this is were we do want to find the longest substring between the two. The longest sub string is "DE5" and there is only one occurrence of it, but from this we have also already found the sub strings "DE" & "E5" so we can search for them at this point as well and we then find that there are 1 occurrence of each. And we just repeat this process. It will take sort of a long time at first, but the more you traverse through the strings, the faster it will work because of eliminating already found occurrences as well as skipping over non found sub strings of S1 in S2.

    This is the logical approach that I took without using any code or programming semantics for it is just the basic algorithm of doing this logically. It now becomes a matter of determination to put this into functions and containers to write a source code implementation of it.

    EDIT - As asked in the comments about the difference of this versus another's answer and with the time & space complexity here is a version of my algorithm doing the first pass searching for single characters and creating the tables of positions and if they exist in the 2nd string. The stored vector in the class contains each unique character in S1 within S2. This can then be used to help find longer substrings.

    // C++ - The user asked for this in C but I haven't used C in nearly 10 years so this is my version of it in C++ :( 
    #include <string>
    #include <vector>
    
    class SubStringSearch {
    private:
        std::string S1;
        std::string S2; 
    
        struct SubstringResult {
            std::string substring;
            bool found;
            std::vector<unsigned> positions;
    
            SubstringResult(){}
            SubstringResult( const std::string& substringIn, bool foundIn, std::vector<unsigned> positionsIn ) :
                substring( substringIn ), found( foundIn ), positions( positionsIn ) {}
        };
    
        std::vector<SubstringResult> results;
    
    public:
        SubStringSearch( const std::string& s1, const std::string& s2 ) : S1( s1 ), S2( s2 ) {}
    
        void compareStringsFirstPass();
        std::vector<unsigned> findLocations( const std::string& str, char findIt );
        void printResults() const;
    
    };
    
    std::vector<unsigned> SubStringSearch::findLocations( const std::string& str, char findIt ) {
        std::vector<unsigned> locations;
        for ( unsigned i = 0; i < str.size(); ++i ) {
            if ( str[i] == findIt ) {
                locations.push_back( i );
            }
        }
        return locations;
    }
    
    void SubStringSearch::compareStringsFirstPass() {
        std::vector<unsigned> positions;
        std::string sub;
        bool alreadyFound = false;
    
        for ( unsigned idx = 0; idx < S1.size(); ++idx ) {
            sub = S1[idx];
    
            if ( idx > 0 ) {
                for ( unsigned u = 0; u < results.size(); ++u ) {
                    if ( sub == results[u].substring ) {
                        alreadyFound = true;
                        break;
                    }
                }
            }
    
            // Added An If Else Here To Reduce Unneeded Calls To findLocations()
            if ( alreadyFound ) {
                alreadyFound = false;
                continue;
            } else {
                positions = findLocations( S2, S1[idx] );
            }
    
            if ( positions.size() > 0 && !alreadyFound ) {
                results.push_back( SubstringResult( sub, true, positions ) );
            } else if ( !alreadyFound ) {
                positions.clear();
                results.push_back( SubstringResult( sub, false, positions ) );
            }
    
            positions.clear();
            alreadyFound = false;
        }
    }
    
    void SubStringSearch::printResults() const {
        for ( unsigned u = 0; u < results.size(); ++u ) {
            if ( results[u].found ) {
                std::cout << results[u].substring << " found in S2 at " << std::setw(2);
                for ( unsigned i = 0; i < results[u].positions.size(); ++i ) {
                    std::cout << std::setw(2) << results[u].positions[i] << " ";
                }
                std::cout << std::endl;
            }
        }
    }
    
    int main() {
        std::string S1( "ABCDE512ABC361EG51D" );
        std::string S2( "ADE5AHDW4131EG1DG5C" );
    
        SubStringSearch searchStrings( S1, S2 );
        searchStrings.compareStringsFirstPass();
    
        std::cout << "break point";
    
        return 0;
    } // main
    

    Place a break point on that last print line and go into your debugger for either your locals or your autos in MSVC or something equivalent for your version of your compiler / debugger and check out the contents of the class's member variable that is a std::vector and you will see the character from S1 and attached to it will be a bool flag if it is found or not as well as a std::vector for each of the positions. So if the flag is false then the vector size should be 0 and vise versa if the vector size is > 0 then the flag should be true; also the size of the vector of positions is also the count or the occurrences of that character in the 2nd string which makes this nice because we don't have to calculate anything else we can just get that from the vector itself.

    Now this is not the complete or full algorithm as this is only the first pass of doing each single character of string 1 and looking into string 2 while building the needed table and skipping over contents that have already been found. It will be up to the OP to build upon this if they so choose to complete the rest of the algorithm. If I happen to find some free time in the near future I may go ahead and complete the full algorithm.

    0 讨论(0)
  • 2020-12-30 05:00

    From what I can understand, breaking up the string to all possible sub-strings is in itself an O(n*n) operation.

    abcd
    ====
    a,b,c,d
    ab,bc,cd
    abc,bcd
    abcd
    ************************
    abcdefgh
    ========
    a,b,c,d,e,f,g,h
    ab,bc,cd,de,ef,fg,gh
    abc,bcd,cde,def,efg,fgh
    abcd,bcde,cdef,defg,efgh
    abcde,bcdef,cdefg,defgh
    abcdef,bcdefg,cdefgh
    abcdefg,bcdefgh
    abcdefgh
    

    As such, it doesn't look like a solution in linear time is possible.

    Further more to actually solve it, from a Java language perspective, you'd have to first break it up and store it in a set or a map (map can have substring as key and the number of occurrences as count).

    Then repeat the step for the second string as well.

    Then you can iterate over the first, checking if the entry exists in the second string's map and also increment the number of occurrences for that sub-string in parallel.

    If you are using 'C', then you can try sorting the array of sub-strings and then use binary search to find matches (while having a two-dimensional array to keep track of the string and the count of occurrences).

    You said you had a tree approach that ran faster. Do you mind posting a sample so as to how you used a tree ? Was it for representing the sub-strings or to help generate it?

    0 讨论(0)
提交回复
热议问题