How to find specific frequency of a codon?

问题

I am trying to make a function in R which could calculate the frequency of each codon. We know that methionine is an amino acid which could be formed by only one set of codon ATG so its percentage in every set of sequence is 1. Where as Glycine could be formed by GGT, GGC, GGA, GGG hence the percentage of occurring of each codon will be 0.25. The input would be in a DNA sequence like-ATGGGTGGCGGAGGG and with the help of codon table it could calculate the percentage of each occurrence in an input.

please help me by suggesting ways to make this function.

for example, if my argument is ATGTGTTGCTGG then, my result would be

ATG=1
TGT=0.5
TGC=0.5
TGG=1

Data for R:

codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T", 
    ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K", 
    AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L", 
    CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P", 
    CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R", 
    CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V", 
    GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D", 
    GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G", 
    GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F", 
    TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop", 
    TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")

回答1:

First, I get my lookup list and sequence.

codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T", 
              ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K", 
              AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L", 
              CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P", 
              CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R", 
              CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V", 
              GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D", 
              GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G", 
              GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F", 
              TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop", 
              TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")

MySeq <- "ATGTGTTGCTGG"

Next, I load the stringi library and break the sequence into chunks of three characters.

# Load library
library(stringi)

# Break into 3 bases
seq_split <- stri_sub(MySeq, seq(1, stri_length(MySeq), by=3), length=3)

Then, I count the letters that these three base chunks correspond to using table.

# Get associated letters
letter_count <- table(unlist(codon[seq_split]))

Finally, I bind the sequences together with the reciprocal of the count and rename my data frame columns.

# Bind into a data frame
res <- data.frame(seq_split,
                  1/letter_count[match(unlist(codon[seq_split]), names(letter_count))])

# Rename columns
colnames(res) <- c("Sequence", "Letter", "Percentage")

#  Sequence Letter Percentage
#1      ATG      M        1.0
#2      TGT      C        0.5
#3      TGC      C        0.5
#4      TGG      W        1.0

回答2:

A slightly different path leads to this solution:

f0 <- function(dna, weight) {
    codons <- regmatches(dna, gregexpr("[ATGC]{3}", dna))
    tibble(id = seq_along(codons), codons = codons) %>%
        unnest() %>%
        mutate(weight = as.vector(wt[codons]))
}

First, codon is just a named vector, not list; here are the weights

codon <- unlist(codon)
weight <- setNames(1 / table(codon)[codon], names(codon))

Second, probably there is a vector of DNA sequences, rather than one.

dna <- c("ATGTGTTGCTGG", "GGTCGTTGTGTA")

To develop the solution, codons can be found by searching for any nucleotide [ACGT] repeated {3} times

codons <- regmatches(dna, gregexpr("[ATGC]{3}", dna))

It seems like it is then convenient to do operations in the tidyverse, creating a tibble (data.frame) where id indicates which sequence the codon is from

library(tidyverse)
tbl <- tibble(id = seq_along(codons), codon = codons) %>% unnest()

and then add the weights

tbl <- mutate(tbl, weight = as.vector(weight[codon]))

so we have

> tbl
# A tibble: 8 x 3
     id codon weight
  <int> <chr>  <dbl>
1     1 ATG    1    
2     1 TGT    0.5  
3     1 TGC    0.5  
4     1 TGG    1    
5     2 GGT    0.25 
6     2 CGT    0.167
7     2 TGT    0.5  
8     2 GTA    0.25

Standard tidyverse operations could be used for further summary, in particular when the same codon appears multiple times

tbl %>% group_by(id, codon) %>%
    summarize(wt = sum(weight))

回答3:

Two things to solve here:

convert codon to the fractions for each letter

( fracs <- 1/table(unlist(codon)) )
#         A         C         D         E         F         G         H         I 
# 0.2500000 0.5000000 0.5000000 0.5000000 0.5000000 0.2500000 0.5000000 0.3333333 
#         K         L         M         N         P         Q         R         S 
# 0.5000000 0.1666667 1.0000000 0.5000000 0.2500000 0.5000000 0.1666667 0.1666667 
#      stop         T         V         W         Y 
# 0.3333333 0.2500000 0.2500000 1.0000000 0.5000000 
codonfracs <- setNames(lapply(codon, function(x) unname(fracs[x])), names(codon))
str(head(codonfracs))
# List of 6
#  $ ATA: num 0.333
#  $ ATC: num 0.333
#  $ ATT: num 0.333
#  $ ATG: num 1
#  $ ACA: num 0.25
#  $ ACC: num 0.25

convert the sequence string to a vector of length-3 substrings

s <- 'ATGTGTTGCTGG'

strsplit3 <- function(s, k=3) {
  starts <- seq.int(1, nchar(s), by=k)
  stops <- c(starts[-1] - 1, nchar(s))
  mapply(substr, s, starts, stops, USE.NAMES=FALSE)
}
strsplit3(s)
# [1] "ATG" "TGT" "TGC" "TGG"

From here, it's just a lookup:

codonfracs[ strsplit3(s) ]
# $ATG
# [1] 1
# $TGT
# [1] 0.5
# $TGC
# [1] 0.5
# $TGG
# [1] 1

EDIT

Since you want the status of the other codons, try this:

x <- codonfracs
x[ ! names(x) %in% strsplit3(s) ] <- 0
str(x)
# List of 64
#  $ ATA: num 0
#  $ ATC: num 0
#  $ ATT: num 0
#  $ ATG: num 1
#  $ ACA: num 0
#  $ ACC: num 0
#  $ ACG: num 0
# ...snip...
#  $ TAT: num 0
#  $ TAA: num 0
#  $ TAG: num 0
#  $ TGC: num 0.5
#  $ TGT: num 0.5
#  $ TGA: num 0
#  $ TGG: num 1

来源：https://stackoverflow.com/questions/50655313/how-to-find-specific-frequency-of-a-codon

标签

bioinformatics

dna-sequence