问题
How can one read FASTA files directly into a data frame in R using base code. These files store information bio-sequence (e.g. DNA or protein) and have 2*n lines for n individual bio-molecules (id1 through idn), and thus are of the type:
>id1 #(always starts with a `>`)
seq1
>id2
seq2
...
>idn
seqn
If one want to be in base R (instead of dedicated packages like Biostrings
and seqinr
, which make use of novel classes for various manipulations of bio-sequences), how can you use e.g. read.table , to get a simple data frame with a id and a seq column?
回答1:
It certainly is possible in base R. Consider the following example and function:
# Demo data
library(CHNOSZ)
file <- system.file("extdata/fasta/EF-Tu.aln", package="CHNOSZ")
# Function
ReadFasta<-function(file) {
# Read the file line by line
fasta<-readLines(file)
# Identify header lines
ind<-grep(">", fasta)
# Identify the sequence lines
s<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], length(fasta)))
# Process sequence lines
seqs<-rep(NA, length(ind))
for(i in 1:length(ind)) {
seqs[i]<-paste(fasta[s$from[i]:s$to[i]], collapse="")
}
# Create a data frame
DF<-data.frame(name=gsub(">", "", fasta[ind]), sequence=seqs)
# Return the data frame as a result object from the function
return(DF)
}
# Usage example
seqs<-ReadFasta(file)
However, be warned: the function does not currently handle, e.g., special characters, which are rather commonplace in sequence files (in context such as 5' or #5 rRNA).
来源:https://stackoverflow.com/questions/26843995/r-read-fasta-files-into-data-frame-using-base-r-not-biostrings-and-the-like