Swap letters in a string

前端 未结 4 773
我寻月下人不归
我寻月下人不归 2021-01-12 12:59

I need to swap letters in a string with the following rules:

  • A is replaced by T
  • T is replaced by A
  • C is replaced by G
  • G is replaced
相关标签:
4条回答
  • 2021-01-12 13:37

    DNA has a small alphabet. You can use a lookup table, replacing some statements with a simple array indexing.

    This approach:

    • Traverses the sequence only once.
    • Eliminates the conditional statements.
    • Can be stable in terms of letter case, which is sometimes used to communicate information in DNA sequences.
    • Can handle IUPAC ambiguity codes.
    • Can handle gaps.
    • Can easily provide a reverse complement.

    First, you need a lookup table.

    private static final String COMPLEMENT_TABLE 
      // 0123456789ABCDEF0123456789ABCDEF
      = "                                " // 0-31
      + "             -                  " // 32-63
      + " TVGH  CD  M KN   YSAABWXR      " // 64-95
      + " tvgh  cd  m kn   ysaabwxr      "; // 96-127
      //  ABCDEFGHIJKLMNOPQRSTUVWXYZ
    
    private static final byte[] COMPLEMENT_TABLE_BYTES 
      = COMPLEMENT_TABLE.getBytes( StandardCharsets.US_ASCII );
    

    Then, you can find the complement's bases by a simple table lookup.

    public static byte[] complement( byte[] sequence ) {
        int length = sequence.length;
        byte[] result = new byte[ length ];
    
        for ( int i = 0; i < length; ++i ) {
            result[i] = COMPLEMENT_TABLE_BYTES[ sequence[i] ];
        }
    
        return result;
    }
    

    If desired for convenience with small sequences, you can provide a method that accepts and returns a String.

    public static String complement( String sequence ) {
        byte[] complementBytes = complement( 
          sequence.getBytes( StandardCharsets.US_ASCII ));
        return new String( complementBytes, StandardCharsets.US_ASCII );
    }
    

    The reverse complement can be computed in the same loop.

    public static byte[] reverseComplement( byte[] sequence ) {
        int length = sequence.length;
        byte[] result = new byte[ length ];
    
        for ( int i = 0; i < length; ++i ) {
            result[ (length - i) - 1] = COMPLEMENT_TABLE_BYTES[ sequence[i] ];
        }
    
        return result;
    }
    
    public static String reverseComplement( String sequence ) {
        byte[] complementBytes = reverseComplement( 
          sequence.getBytes( StandardCharsets.US_ASCII ));
        return new String( complementBytes, StandardCharsets.US_ASCII );
    }
    

    Using your example sequence:

    public static void main(String[] args) {
        String sequence = "ACGTA";
    
        String complementSequence = complement( sequence );
        System.out.println( String.format( 
           "complement(%s) = %s", sequence, complementSequence ));
    
        String reverseComplementSequence = reverseComplement( sequence );
        System.out.println( String.format( 
          "reverseComplement(%s) = %s", sequence, reverseComplementSequence ));
    }
    

    We get this output:

    complement(ACGTA) = TGCAT
    reverseComplement(ACGTA) = TACGT
    
    0 讨论(0)
  • 2021-01-12 13:41

    Like I explained yesterday, Strings are immutable, you can't change a String, you have to create a new one and replace the old one.

    You can solve your problem like this:

    String s = "ACGTA";
    StringBuilder sb= new StringBuilder();
    for (char c:s.toCharArray()) {
      switch(c) {
        case 'A': sb.append('T');break;
        case 'T': sb.append('A');break;
        case 'C': sb.append('G');break;
        case 'G': sb.append('C');break;
        default: //handle error here -> invalid char in String
      }
    }
    s = sb.toString();
    

    The advantage of this solution is that you just don't create too many String objects (every 'replace' operation creates a new String and this can kill performance if you have to revert a lot of dna sequences)


    Here is more performant version based on the very helpful comments from polygenelubricants and rsp:

    String s = "ACGTA";
    char[] reverse = new char[s.length()];
    for (int i = 0; i < reverse.length; i++) {
      switch(s.charAt(i)) {
        case 'A': reverse[i] = 'T';break;
        case 'T': reverse[i] = 'A';break;
        case 'C': reverse[i] = 'G';break;
        case 'G': reverse[i] = 'C';break;
        default: //handle error here -> invalid char in String
      }
    }
    s = new String(reverse);
    
    0 讨论(0)
  • I would go for a more general solution like this:

    public String tr(String original, String trFrom, String trTo) {
      StringBuilder sb = new StringBuilder();
    
      for (int i = 0; i < original.length(); ++i) {
        int charIndex = trFrom.indexOf(original.charAt(i));
        if (charIndex >= 0) {
          sb.append(trTo.charAt(charIndex));
        } else {
          sb.append(original.charAt(i));
        }
      }
    
      return sb.toString(); 
    }
    

    Calling the function like this would give the result you need:

    tr("ACGTA", "ATCG", "TAGC")
    

    So the function is pretty much the same as unix tr utility:

    echo ACGTA | tr ATCG TAGC
    
    0 讨论(0)
  • 2021-01-12 13:50

    Searching for java "A to T, T to A" found this suggestion:

    String sequence = "AATTTCTCGGTTTCAAT";
    sequence = sequence.replace("A", "t")
                       .replace("T", "a")
                       .replace("C", "g")
                       .replace("G", "c")
                       .toUpperCase();
    System.out.println(sequence);
    

    This is a simple and concise solution that works for your specific situation and will have acceptable performance if your DNA strings are relatively short. For a more general solution for handling large amounts of data you should iterate over the characters one by one and build a new string. Or as polygenelubricants pointed out - consider a storage format that only uses 2 bits per base instead of 16.

    0 讨论(0)
提交回复
热议问题