Complement a DNA sequence

后端 未结 7 1713
情书的邮戳
情书的邮戳 2020-12-31 12:07

Suppose I have a DNA sequence. I want to get the complement of it. I used the following code but I am not getting it. What am I doing wrong ?

s=readline()
AT         


        
相关标签:
7条回答
  • 2020-12-31 12:17
    sapply(p, switch,  "A"="T", "T"="A","G"="C","C"="G")
      A   T   C   T   C   G   G   C   G   C   G   C   A   T   C   G   C   G   T 
    "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A" 
      A   C   G   C   T   A   C   T   A   G   C 
    "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G" 
    

    If you do not want the complementary names, you can always strip them with unname.

    unname(sapply(p, switch,  "A"="T", "T"="A","G"="C","C"="G") )
     [1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C"
    [19] "A" "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
    > 
    
    0 讨论(0)
  • 2020-12-31 12:23

    The Bioconductor package Biostrings has many useful functions for this sort of operation. Install once:

    source("http://bioconductor.org/biocLite.R")
    biocLite("Biostrings")
    

    then use

    library(Biostrings)
    dna = DNAStringSet(c("ATCTCGGCGCGCATCGCGTACGCTACTAGC", "ACCGCTA"))
    complement(dna)
    
    0 讨论(0)
  • 2020-12-31 12:24

    Use chartr which is built for this purpose:

    > s
    [1] "ATCTCGGCGCGCATCGCGTACGCTACTAGC"
    > chartr("ATGC","TACG",s)
    [1] "TAGAGCCGCGCGTAGCGCATGCGATGATCG"
    

    Just give it two equal-length character strings and your string. Also vectorised over the argument for translation:

    > chartr("ATGC","TACG",c("AAAACG","TTTTT"))
    [1] "TTTTGC" "AAAAA" 
    

    Note I'm doing the replacement on the string representation of the DNA rather than the vector. To convert the vector I'd create a lookup-map as a named vector and index that:

    > p
     [1] "A" "T" "C" "T" "C" "G" "G" "C" "G" "C" "G" "C" "A" "T" "C" "G" "C" "G" "T"
    [20] "A" "C" "G" "C" "T" "A" "C" "T" "A" "G" "C"
    > map=c("A"="T", "T"="A","G"="C","C"="G")
    > unname(map[p])
     [1] "T" "A" "G" "A" "G" "C" "C" "G" "C" "G" "C" "G" "T" "A" "G" "C" "G" "C" "A"
    [20] "T" "G" "C" "G" "A" "T" "G" "A" "T" "C" "G"
    
    0 讨论(0)
  • 2020-12-31 12:29

    Here a answer using base r. Written with a horrible formatting to make things clear and to keep it as a one-liner. It supports upper and lower cases.

    revc = function(s){
           paste0(
               rev(
                unlist(
                 strsplit(
                    chartr("ATGCatgc","TACGtacg",s)
                          , "")                        # from strsplit
                       )                               # from unlist
                   )                                   # from rev
                 , collapse='')                        # from paste0
           }
    
    0 讨论(0)
  • 2020-12-31 12:30

    To complement, in both upper and lower case, you can use chartr():

    n <- "ACCTGccatGCATC"
    chartr("acgtACGT", "tgcaTGCA", n)
    # [1] "TGGACggtaCGTAG"
    

    To take it a step further and reverse complement the nucleotide sequence, you can use the following function:

    library(stringi)
    
    rc <- function(nucSeq)
      return(stri_reverse(chartr("acgtACGT", "tgcaTGCA", nucSeq)))
    
    rc("AcACGTgtT")
    # [1] "AacACGTgT"
    
    0 讨论(0)
  • 2020-12-31 12:32

    There is also a package seqinr

    library(seqinr)
    comp(seq) # gives complement
    rev(comp(seq)) # gives the reverse complement
    

    Biostrings has a much smaller memory profile, but seqinr is nice also because you can choose the case of the bases (including mixed) and change them to anything you want, for example if you want a mix of T and U in the same sequence. Biostrings forces you to have either T or U.

    0 讨论(0)
提交回复
热议问题