I need to swap letters in a string with the following rules:
DNA has a small alphabet. You can use a lookup table, replacing some statements with a simple array indexing.
This approach:
First, you need a lookup table.
private static final String COMPLEMENT_TABLE
// 0123456789ABCDEF0123456789ABCDEF
= " " // 0-31
+ " - " // 32-63
+ " TVGH CD M KN YSAABWXR " // 64-95
+ " tvgh cd m kn ysaabwxr "; // 96-127
// ABCDEFGHIJKLMNOPQRSTUVWXYZ
private static final byte[] COMPLEMENT_TABLE_BYTES
= COMPLEMENT_TABLE.getBytes( StandardCharsets.US_ASCII );
Then, you can find the complement's bases by a simple table lookup.
public static byte[] complement( byte[] sequence ) {
int length = sequence.length;
byte[] result = new byte[ length ];
for ( int i = 0; i < length; ++i ) {
result[i] = COMPLEMENT_TABLE_BYTES[ sequence[i] ];
}
return result;
}
If desired for convenience with small sequences, you can provide a method that accepts and returns a String.
public static String complement( String sequence ) {
byte[] complementBytes = complement(
sequence.getBytes( StandardCharsets.US_ASCII ));
return new String( complementBytes, StandardCharsets.US_ASCII );
}
The reverse complement can be computed in the same loop.
public static byte[] reverseComplement( byte[] sequence ) {
int length = sequence.length;
byte[] result = new byte[ length ];
for ( int i = 0; i < length; ++i ) {
result[ (length - i) - 1] = COMPLEMENT_TABLE_BYTES[ sequence[i] ];
}
return result;
}
public static String reverseComplement( String sequence ) {
byte[] complementBytes = reverseComplement(
sequence.getBytes( StandardCharsets.US_ASCII ));
return new String( complementBytes, StandardCharsets.US_ASCII );
}
Using your example sequence:
public static void main(String[] args) {
String sequence = "ACGTA";
String complementSequence = complement( sequence );
System.out.println( String.format(
"complement(%s) = %s", sequence, complementSequence ));
String reverseComplementSequence = reverseComplement( sequence );
System.out.println( String.format(
"reverseComplement(%s) = %s", sequence, reverseComplementSequence ));
}
We get this output:
complement(ACGTA) = TGCAT
reverseComplement(ACGTA) = TACGT
Like I explained yesterday, Strings are immutable, you can't change a String, you have to create a new one and replace the old one.
You can solve your problem like this:
String s = "ACGTA";
StringBuilder sb= new StringBuilder();
for (char c:s.toCharArray()) {
switch(c) {
case 'A': sb.append('T');break;
case 'T': sb.append('A');break;
case 'C': sb.append('G');break;
case 'G': sb.append('C');break;
default: //handle error here -> invalid char in String
}
}
s = sb.toString();
The advantage of this solution is that you just don't create too many String objects (every 'replace' operation creates a new String and this can kill performance if you have to revert a lot of dna sequences)
Here is more performant version based on the very helpful comments from polygenelubricants and rsp:
String s = "ACGTA";
char[] reverse = new char[s.length()];
for (int i = 0; i < reverse.length; i++) {
switch(s.charAt(i)) {
case 'A': reverse[i] = 'T';break;
case 'T': reverse[i] = 'A';break;
case 'C': reverse[i] = 'G';break;
case 'G': reverse[i] = 'C';break;
default: //handle error here -> invalid char in String
}
}
s = new String(reverse);
I would go for a more general solution like this:
public String tr(String original, String trFrom, String trTo) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < original.length(); ++i) {
int charIndex = trFrom.indexOf(original.charAt(i));
if (charIndex >= 0) {
sb.append(trTo.charAt(charIndex));
} else {
sb.append(original.charAt(i));
}
}
return sb.toString();
}
Calling the function like this would give the result you need:
tr("ACGTA", "ATCG", "TAGC")
So the function is pretty much the same as unix tr utility:
echo ACGTA | tr ATCG TAGC
Searching for java "A to T, T to A" found this suggestion:
String sequence = "AATTTCTCGGTTTCAAT";
sequence = sequence.replace("A", "t")
.replace("T", "a")
.replace("C", "g")
.replace("G", "c")
.toUpperCase();
System.out.println(sequence);
This is a simple and concise solution that works for your specific situation and will have acceptable performance if your DNA strings are relatively short. For a more general solution for handling large amounts of data you should iterate over the characters one by one and build a new string. Or as polygenelubricants pointed out - consider a storage format that only uses 2 bits per base instead of 16.