问题
I am trying to make a function in R which could calculate the frequency of each codon. We know that methionine is an amino acid which could be formed by only one set of codon ATG so its percentage in every set of sequence is 1. Where as Glycine could be formed by GGT, GGC, GGA, GGG hence the percentage of occurring of each codon will be 0.25. The input would be in a DNA sequence like-ATGGGTGGCGGAGGG and with the help of codon table it could calculate the percentage of each occurrence in an input.
please help me by suggesting ways to make this function.
for example, if my argument is ATGTGTTGCTGG then, my result would be
ATG=1
TGT=0.5
TGC=0.5
TGG=1
Data for R:
codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T",
ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K",
AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L",
CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P",
CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R",
CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V",
GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D",
GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G",
GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F",
TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop",
TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")
回答1:
First, I get my lookup list and sequence.
codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T",
ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K",
AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L",
CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P",
CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R",
CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V",
GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D",
GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G",
GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F",
TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop",
TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")
MySeq <- "ATGTGTTGCTGG"
Next, I load the stringi
library and break the sequence into chunks of three characters.
# Load library
library(stringi)
# Break into 3 bases
seq_split <- stri_sub(MySeq, seq(1, stri_length(MySeq), by=3), length=3)
Then, I count the letters that these three base chunks correspond to using table
.
# Get associated letters
letter_count <- table(unlist(codon[seq_split]))
Finally, I bind the sequences together with the reciprocal of the count and rename my data frame columns.
# Bind into a data frame
res <- data.frame(seq_split,
1/letter_count[match(unlist(codon[seq_split]), names(letter_count))])
# Rename columns
colnames(res) <- c("Sequence", "Letter", "Percentage")
# Sequence Letter Percentage
#1 ATG M 1.0
#2 TGT C 0.5
#3 TGC C 0.5
#4 TGG W 1.0
回答2:
A slightly different path leads to this solution:
f0 <- function(dna, weight) {
codons <- regmatches(dna, gregexpr("[ATGC]{3}", dna))
tibble(id = seq_along(codons), codons = codons) %>%
unnest() %>%
mutate(weight = as.vector(wt[codons]))
}
First, codon
is just a named vector, not list; here are the weights
codon <- unlist(codon)
weight <- setNames(1 / table(codon)[codon], names(codon))
Second, probably there is a vector of DNA sequences, rather than one.
dna <- c("ATGTGTTGCTGG", "GGTCGTTGTGTA")
To develop the solution, codons can be found by searching for any nucleotide [ACGT]
repeated {3}
times
codons <- regmatches(dna, gregexpr("[ATGC]{3}", dna))
It seems like it is then convenient to do operations in the tidyverse, creating a tibble (data.frame) where id
indicates which sequence the codon is from
library(tidyverse)
tbl <- tibble(id = seq_along(codons), codon = codons) %>% unnest()
and then add the weights
tbl <- mutate(tbl, weight = as.vector(weight[codon]))
so we have
> tbl
# A tibble: 8 x 3
id codon weight
<int> <chr> <dbl>
1 1 ATG 1
2 1 TGT 0.5
3 1 TGC 0.5
4 1 TGG 1
5 2 GGT 0.25
6 2 CGT 0.167
7 2 TGT 0.5
8 2 GTA 0.25
Standard tidyverse operations could be used for further summary, in particular when the same codon appears multiple times
tbl %>% group_by(id, codon) %>%
summarize(wt = sum(weight))
回答3:
Two things to solve here:
convert
codon
to the fractions for each letter( fracs <- 1/table(unlist(codon)) ) # A C D E F G H I # 0.2500000 0.5000000 0.5000000 0.5000000 0.5000000 0.2500000 0.5000000 0.3333333 # K L M N P Q R S # 0.5000000 0.1666667 1.0000000 0.5000000 0.2500000 0.5000000 0.1666667 0.1666667 # stop T V W Y # 0.3333333 0.2500000 0.2500000 1.0000000 0.5000000 codonfracs <- setNames(lapply(codon, function(x) unname(fracs[x])), names(codon)) str(head(codonfracs)) # List of 6 # $ ATA: num 0.333 # $ ATC: num 0.333 # $ ATT: num 0.333 # $ ATG: num 1 # $ ACA: num 0.25 # $ ACC: num 0.25
convert the sequence string to a vector of length-3 substrings
s <- 'ATGTGTTGCTGG' strsplit3 <- function(s, k=3) { starts <- seq.int(1, nchar(s), by=k) stops <- c(starts[-1] - 1, nchar(s)) mapply(substr, s, starts, stops, USE.NAMES=FALSE) } strsplit3(s) # [1] "ATG" "TGT" "TGC" "TGG"
From here, it's just a lookup:
codonfracs[ strsplit3(s) ]
# $ATG
# [1] 1
# $TGT
# [1] 0.5
# $TGC
# [1] 0.5
# $TGG
# [1] 1
EDIT
Since you want the status of the other codons, try this:
x <- codonfracs
x[ ! names(x) %in% strsplit3(s) ] <- 0
str(x)
# List of 64
# $ ATA: num 0
# $ ATC: num 0
# $ ATT: num 0
# $ ATG: num 1
# $ ACA: num 0
# $ ACC: num 0
# $ ACG: num 0
# ...snip...
# $ TAT: num 0
# $ TAA: num 0
# $ TAG: num 0
# $ TGC: num 0.5
# $ TGT: num 0.5
# $ TGA: num 0
# $ TGG: num 1
来源:https://stackoverflow.com/questions/50655313/how-to-find-specific-frequency-of-a-codon