How to find repeating sequence of characters in a given array?

后端 未结 14 777
故里飘歌
故里飘歌 2020-12-02 12:42

My problem is to find the repeating sequence of characters in the given array. simply, to identify the pattern in which the characters are appearing.

          


        
相关标签:
14条回答
  • 2020-12-02 13:16

    Not sure how you define "efficiently". For easy/fast implementation you could do this in Java:

        private static String findSequence(String text) {
            Pattern pattern = Pattern.compile("(.+?)\\1+");
            Matcher matcher = pattern.matcher(text);
            return matcher.matches() ? matcher.group(1) : null;
        }
    

    it tries to find the shortest string (.+?) that must be repeated at least once (\1+) to match the entire input text.

    0 讨论(0)
  • 2020-12-02 13:17

    Here is a more general solution to the problem, that will find repeating subsequences within an sequence (of anything), where the subsequences do not have to start at the beginning, nor immediately follow each other.

    given an sequence b[0..n], containing the data in question, and a threshold t being the minimum subsequence length to find,

    l_max = 0, i_max = 0, j_max = 0;
    for (i=0; i<n-(t*2);i++) {
      for (j=i+t;j<n-t; j++) {
        l=0;
        while (i+l<j && j+l<n && b[i+l] == b[j+l])
          l++;
        if (l>t) {
          print "Sequence of length " + l + " found at " + i + " and " + j);
          if (l>l_max) {
            l_max = l;
            i_max = i;
            j_max = j;
          }
        }
      }
    }
    if (l_max>t) {
      print "longest common subsequence found at " + i_max + " and " + j_max + " (" + l_max + " long)";
    }
    

    Basically:

    1. Start at the beginning of the data, iterate until within 2*t of the end (no possible way to have two distinct subsequences of length t in less than 2*t of space!)
    2. For the second subsequence, start at least t bytes beyond where the first sequence begins.
    3. Then, reset the length of the discovered subsequence to 0, and check to see if you have a common character at i+l and j+l. As long as you do, increment l. When you no longer have a common character, you have reached the end of your common subsequence. If the subsequence is longer than your threshold, print the result.
    0 讨论(0)
提交回复
热议问题