fasta | 易学教程

Extract rows and substrings from one file conditional on information of another file

阅读更多关于 Extract rows and substrings from one file conditional on information of another file

How to extract FASTA sequence using sequence ID (shell script)

阅读更多关于 How to extract FASTA sequence using sequence ID (shell script)

问题 I have the following sequences which is in a fasta format with sequence header and its nucleotides. How can I compare two files(Kcompare.pep and clade1i.txt) and extract the sequences with the same sequence header? Can anyone help me? Kcompare.pep >ztr:MYCGRDRAFT_45998 MAAPLHAEGPIRTPYTGVELLNTPYLNKGTAFPADERRVLGLTALLPTSVHTLDQQLQRA WHQYQSRDNDLARNTFLTSLKEQNEVLYYRLVLDHLSEVFSIIYTPTEGEAIQRYSSLFR >kal:KALB_5042 MTAEVAVVSDGSAIPGASPPATLPLLQDYAELVREHAGLSAVPLAVDSARLAAELCALPK

using awk to find pattern if line starts with “>” and add at the end of it the number of occurences of the pattern

阅读更多关于 using awk to find pattern if line starts with “>” and add at the end of it the number of occurences of the pattern

问题 I have been struggling with awk to figure out a way to find identical patterns and add a tag at the end of them showing how many times they are present in the file. For example, if Spiroplasma_culicicola occurs 7 times, then next to the first occurrence, it should write Spiroplasma_culicicola_1, next to the second occurrence Spiroplasma_culicicola_2 next to the third occurrence Spiroplasma_culicicola_3 etc etc However I have a fasta file that looks like this: >Spiroplasma_taiwanense

R: read fasta files into data.frame using base R - NOT Biostrings (and the like)

阅读更多关于 R: read fasta files into data.frame using base R - NOT Biostrings (and the like)

问题 How can one read FASTA files directly into a data frame in R using base code. These files store information bio-sequence (e.g. DNA or protein) and have 2*n lines for n individual bio-molecules (id1 through idn), and thus are of the type: >id1 #(always starts with a `>`) seq1 >id2 seq2 ... >idn seqn If one want to be in base R (instead of dedicated packages like Biostrings and seqinr , which make use of novel classes for various manipulations of bio-sequences), how can you use e.g. read.table

Reading at three different frames

阅读更多关于 Reading at three different frames

问题 So I'm trying to create a class that reads a DNA string in three different frames - one that starts at position 0 (or the first base), another that starts in position 1 (the second base), and a third that starts reading at position 2 (the third base). So far, this is what I've been playing around with: def codons(self, frame_one, frame_two, frame_three): start = frame_one while start + 3 <=len(self.seq): yield (self.seq[start:start+3], start) start += 3 start+1 = frame_two while start + 3 <

sed convert multiline bloc to single line (ex: fasta to phylip format)

阅读更多关于 sed convert multiline bloc to single line (ex: fasta to phylip format)

问题 In short : how to convert from fasta to "phylip"-like format (without the sequence and residu counts at the top of the file) using sed ? A fasta format is like this: >sequence1 AATCG GG-AT >sequence2 AGTCG GGGAT The number of lines of a sequence may vary. I want to convert it to this: sequence1 AATCG GG-AT sequence2 AGTCG GGGAT My question seems simple, but I am lacking a real understanding of the advanced commands in sed , the multiline commands and the commands using the hold buffer. Here

Remove multiple sequences from fasta file

阅读更多关于 Remove multiple sequences from fasta file

问题 I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow: >header1 aaaaaaaaa >header2 bbbbbbbbbbb >header3 aaabbbaaaa [...] >headerN aaabbaabaa In an other file I have a list of headers of sequences that I would like to remove, like this: >header1 >header5 >header12 [...] >header145 The idea is to remove these sequences from the first file, so all these headers+the following line. I did

Parse fasta sequence to the dictionary

阅读更多关于 Parse fasta sequence to the dictionary

问题 I need most trivial solution to convert fasta.txt containing multiple nucleotide sequences like >seq1 TAGATTCTGAGTTATCTCTTGCATTAGCAGGTCATCCTGGTCAAACCGCTACTGTTCCGG CTTTCTGATAATTGATAGCATACGCTGCGAACCCACGGAAGGGGGTCGAGGACAGTGGTG >seq2 TCCCTCTAGAGGCTCTTTACCGTGATGCTACATCTTACAGGTATTTCTGAGGCTCTTTCA AACAGGTGCGCGTGAACAACAACCCACGGCAAACGAGTACAGTGTGTACGCCTGAGAGTA >seq3 GGTTCCGCTCTAAGCCTCTAACTCCCGCACAGGGAAGAGATGTCGATTAACTTGCGCCCA TAGAGCTCTGCGCGTGCGTCGAAGGCTCTTTTCGCGATATCTGTGTGGTCTCACTTTGGT to the dictionary

Biopython SeqIO to Pandas Dataframe

阅读更多关于 Biopython SeqIO to Pandas Dataframe

问题 I have a FASTA file that can easily be parsed by SeqIO.parse. I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.) from Bio import SeqIO import pandas as pd # parse sequence fasta file identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta", "fasta")] lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta", "fasta")] #converting lists to

Using Interval tree to find overlapping regions

阅读更多关于 Using Interval tree to find overlapping regions

问题 I have two files File 1 chr1:4847593-4847993 TGCCGGAGGGGTTTCGATGGAACTCGTAGCA File 2 Pbsn|X|75083240|75098962| TTTACTACTTAGTAACACAGTAAGCTAAACAACCAGTGCCATGGTAGGCTTGAGTCAGCT CTTTCAGGTTCATGTCCATCAAAGATCTACATCTCTCCCCTGGTAGCTTAAGAGAAGCCA TGGTGGTTGGTATTTCCTACTGCCAGACAGCTGGTTGTTAAGTGAATATTTTGAAGTCC File 1 has approximately 8000 more lines with different header and sequence below it. I would first like to match the start and end co ordinates from file1 to file 2 or see if its close to each other let