R: read fasta files into data.frame using base R - NOT Biostrings (and the like)

瘦欲@ 提交于 2019-12-12 01:55:52

问题


How can one read FASTA files directly into a data frame in R using base code. These files store information bio-sequence (e.g. DNA or protein) and have 2*n lines for n individual bio-molecules (id1 through idn), and thus are of the type:

>id1 #(always starts with a `>`) 
seq1
>id2
seq2
...
>idn
seqn

If one want to be in base R (instead of dedicated packages like Biostrings and seqinr, which make use of novel classes for various manipulations of bio-sequences), how can you use e.g. read.table , to get a simple data frame with a id and a seq column?


回答1:


It certainly is possible in base R. Consider the following example and function:

# Demo data
library(CHNOSZ)
file <- system.file("extdata/fasta/EF-Tu.aln", package="CHNOSZ")

# Function
ReadFasta<-function(file) {
   # Read the file line by line
   fasta<-readLines(file)
   # Identify header lines
   ind<-grep(">", fasta)
   # Identify the sequence lines
   s<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], length(fasta)))
   # Process sequence lines
   seqs<-rep(NA, length(ind))
   for(i in 1:length(ind)) {
      seqs[i]<-paste(fasta[s$from[i]:s$to[i]], collapse="")
   }
   # Create a data frame 
   DF<-data.frame(name=gsub(">", "", fasta[ind]), sequence=seqs)
   # Return the data frame as a result object from the function
   return(DF)
}

# Usage example
seqs<-ReadFasta(file)

However, be warned: the function does not currently handle, e.g., special characters, which are rather commonplace in sequence files (in context such as 5' or #5 rRNA).



来源:https://stackoverflow.com/questions/26843995/r-read-fasta-files-into-data-frame-using-base-r-not-biostrings-and-the-like

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!