Swap letters in a string

前端未结

关注

 4  774

我寻月下人不归

I need to swap letters in a string with the following rules:

A is replaced by T
T is replaced by A
C is replaced by G
G is replaced

相关标签:

4条回答

迷失自我

2021-01-12 13:37

DNA has a small alphabet. You can use a lookup table, replacing some statements with a simple array indexing.

This approach:

Traverses the sequence only once.
Eliminates the conditional statements.
Can be stable in terms of letter case, which is sometimes used to communicate information in DNA sequences.
Can handle IUPAC ambiguity codes.
Can handle gaps.
Can easily provide a reverse complement.

First, you need a lookup table.

private static final String COMPLEMENT_TABLE 
  // 0123456789ABCDEF0123456789ABCDEF
  = "                                " // 0-31
  + "             -                  " // 32-63
  + " TVGH  CD  M KN   YSAABWXR      " // 64-95
  + " tvgh  cd  m kn   ysaabwxr      "; // 96-127
  //  ABCDEFGHIJKLMNOPQRSTUVWXYZ

private static final byte[] COMPLEMENT_TABLE_BYTES 
  = COMPLEMENT_TABLE.getBytes( StandardCharsets.US_ASCII );

Then, you can find the complement's bases by a simple table lookup.

public static byte[] complement( byte[] sequence ) {
    int length = sequence.length;
    byte[] result = new byte[ length ];

    for ( int i = 0; i < length; ++i ) {
        result[i] = COMPLEMENT_TABLE_BYTES[ sequence[i] ];
    }

    return result;
}

If desired for convenience with small sequences, you can provide a method that accepts and returns a String.

public static String complement( String sequence ) {
    byte[] complementBytes = complement( 
      sequence.getBytes( StandardCharsets.US_ASCII ));
    return new String( complementBytes, StandardCharsets.US_ASCII );
}

The reverse complement can be computed in the same loop.

public static byte[] reverseComplement( byte[] sequence ) {
    int length = sequence.length;
    byte[] result = new byte[ length ];

    for ( int i = 0; i < length; ++i ) {
        result[ (length - i) - 1] = COMPLEMENT_TABLE_BYTES[ sequence[i] ];
    }

    return result;
}

public static String reverseComplement( String sequence ) {
    byte[] complementBytes = reverseComplement( 
      sequence.getBytes( StandardCharsets.US_ASCII ));
    return new String( complementBytes, StandardCharsets.US_ASCII );
}

Using your example sequence:

public static void main(String[] args) {
    String sequence = "ACGTA";

    String complementSequence = complement( sequence );
    System.out.println( String.format( 
       "complement(%s) = %s", sequence, complementSequence ));

    String reverseComplementSequence = reverseComplement( sequence );
    System.out.println( String.format( 
      "reverseComplement(%s) = %s", sequence, reverseComplementSequence ));
}

We get this output:

complement(ACGTA) = TGCAT
reverseComplement(ACGTA) = TACGT

0 讨论(0)

名媛妹妹

2021-01-12 13:41

Like I explained yesterday, Strings are immutable, you can't change a String, you have to create a new one and replace the old one.

You can solve your problem like this:

String s = "ACGTA";
StringBuilder sb= new StringBuilder();
for (char c:s.toCharArray()) {
  switch(c) {
    case 'A': sb.append('T');break;
    case 'T': sb.append('A');break;
    case 'C': sb.append('G');break;
    case 'G': sb.append('C');break;
    default: //handle error here -> invalid char in String
  }
}
s = sb.toString();

The advantage of this solution is that you just don't create too many String objects (every 'replace' operation creates a new String and this can kill performance if you have to revert a lot of dna sequences)

Here is more performant version based on the very helpful comments from polygenelubricants and rsp:

String s = "ACGTA";
char[] reverse = new char[s.length()];
for (int i = 0; i < reverse.length; i++) {
  switch(s.charAt(i)) {
    case 'A': reverse[i] = 'T';break;
    case 'T': reverse[i] = 'A';break;
    case 'C': reverse[i] = 'G';break;
    case 'G': reverse[i] = 'C';break;
    default: //handle error here -> invalid char in String
  }
}
s = new String(reverse);

0 讨论(0)

不要未来只要你来

2021-01-12 13:43

I would go for a more general solution like this:

public String tr(String original, String trFrom, String trTo) {
  StringBuilder sb = new StringBuilder();

  for (int i = 0; i < original.length(); ++i) {
    int charIndex = trFrom.indexOf(original.charAt(i));
    if (charIndex >= 0) {
      sb.append(trTo.charAt(charIndex));
    } else {
      sb.append(original.charAt(i));
    }
  }

  return sb.toString(); 
}

Calling the function like this would give the result you need:

tr("ACGTA", "ATCG", "TAGC")

So the function is pretty much the same as unix tr utility:

echo ACGTA | tr ATCG TAGC

0 讨论(0)

陌清茗

2021-01-12 13:50
Searching for java "A to T, T to A" found this suggestion:
```
String sequence = "AATTTCTCGGTTTCAAT";
sequence = sequence.replace("A", "t")
                   .replace("T", "a")
                   .replace("C", "g")
                   .replace("G", "c")
                   .toUpperCase();
System.out.println(sequence);
```
This is a simple and concise solution that works for your specific situation and will have acceptable performance if your DNA strings are relatively short. For a more general solution for handling large amounts of data you should iterate over the characters one by one and build a new string. Or as polygenelubricants pointed out - consider a storage format that only uses 2 bits per base instead of 16.
0 讨论(0)
发布评论:

提交评论
- 加载中...