问题
Suppose I have a DNA sequence. I want to get the complement of it. I used the following code but I am not getting it. What am I doing wrong ?
s=readline()
ATCTCGGCGCGCATCGCGTACGCTACTAGC
p=unlist(strsplit(s,""))
h=rep("N",nchar(s))
unlist(lapply(p,function(d){
for b in (1:nchar(s)) {
if (p[b]=="A") h[b]="T"
if (p[b]=="T") h[b]="A"
if (p[b]=="G") h[b]="C"
if (p[b]=="C") h[b]="G"
}
回答1:
Use chartr
which is built for this purpose:
> s
[1] "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> chartr("ATGC","TACG",s)
[1] "TAGAGCCGCGCGTAGCGCATGCGATGATCG"
Just give it two equal-length character strings and your string. Also vectorised over the argument for translation:
> chartr("ATGC","TACG",c("AAAACG","TTTTT"))
[1] "TTTTGC" "AAAAA"
Note I'm doing the replacement on the string representation of the DNA rather than the vector. To convert the vector I'd create a lookup-map as a named vector and index that:
> p
[1] "A" "T" "C" "T" "C" "G" "G" "C" "G" "C" "G" "C" "A" "T" "C" "G" "C" "G" "T"
[20] "A" "C" "G" "C" "T" "A" "C" "T" "A" "G" "C"
> map=c("A"="T", "T"="A","G"="C","C"="G")
> unname(map[p])
[1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A"
[20] "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
回答2:
The Bioconductor package Biostrings has many useful functions for this sort of operation. Install once:
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
then use
library(Biostrings)
dna = DNAStringSet(c("ATCTCGGCGCGCATCGCGTACGCTACTAGC", "ACCGCTA"))
complement(dna)
回答3:
sapply(p, switch, "A"="T", "T"="A","G"="C","C"="G")
A T C T C G G C G C G C A T C G C G T
"T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A"
A C G C T A C T A G C
"T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
If you do not want the complementary names, you can always strip them with unname
.
unname(sapply(p, switch, "A"="T", "T"="A","G"="C","C"="G") )
[1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C"
[19] "A" "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
>
回答4:
There is also a package seqinr
library(seqinr)
comp(seq) # gives complement
rev(comp(seq)) # gives the reverse complement
Biostrings has a much smaller memory profile, but seqinr is nice also because you can choose the case of the bases (including mixed) and change them to anything you want, for example if you want a mix of T and U in the same sequence. Biostrings forces you to have either T or U.
回答5:
To complement, in both upper and lower case, you can use chartr()
:
n <- "ACCTGccatGCATC"
chartr("acgtACGT", "tgcaTGCA", n)
# [1] "TGGACggtaCGTAG"
To take it a step further and reverse complement the nucleotide sequence, you can use the following function:
library(stringi)
rc <- function(nucSeq)
return(stri_reverse(chartr("acgtACGT", "tgcaTGCA", nucSeq)))
rc("AcACGTgtT")
# [1] "AacACGTgT"
回答6:
Here a answer using base r. Written with a horrible formatting to make things clear and to keep it as a one-liner. It supports upper and lower cases.
revc = function(s){
paste0(
rev(
unlist(
strsplit(
chartr("ATGCatgc","TACGtacg",s)
, "") # from strsplit
) # from unlist
) # from rev
, collapse='') # from paste0
}
回答7:
I've generalised the solution rev(comp(seq))
with the seqinr
package:
install.packages("devtools")
devtools::install_github("TomKellyGenetics/tktools")
tktools::revcomp(seq)
This version is compatible with string inputs and is vectorised to handle list or vector input of multiple strings. The output class should match the input, including cases and types. This also support inputs containing "U" for RNA and RNA output sequences.
> seq <- "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
> revcomp(seq)
[1] "GCTAGTAGCGTACGCGATGCGCGCCGAGAT"
> seq <- c("TATAAT", "TTTCGC", "atgcat")
> revcomp(seq)
TATAAT TTTCGC atgcat
"ATTATA" "GCGAAA" "atgcat"
See the manual or the TomKellyGenetics/tktools github package repository.
来源:https://stackoverflow.com/questions/20371854/complement-a-dna-sequence