fasta

Extract rows and substrings from one file conditional on information of another file

為{幸葍}努か 提交于 2019-12-12 15:17:13
问题 I have a file 1.blast with coordinate information like this 1 gnl|BL_ORD_ID|0 100.00 33 0 0 1 3 27620 gnl|BL_ORD_ID|0 95.65 46 2 0 1 46 35296 gnl|BL_ORD_ID|0 90.91 44 4 0 3 46 35973 gnl|BL_ORD_ID|0 100.00 45 0 0 1 45 41219 gnl|BL_ORD_ID|0 100.00 27 0 0 1 27 46914 gnl|BL_ORD_ID|0 100.00 45 0 0 1 45 and a file 1.fasta with sequence information like this >1 TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG >2 GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC ... >100000 TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTG I

How to extract FASTA sequence using sequence ID (shell script)

南楼画角 提交于 2019-12-12 09:22:54
问题 I have the following sequences which is in a fasta format with sequence header and its nucleotides. How can I compare two files(Kcompare.pep and clade1i.txt) and extract the sequences with the same sequence header? Can anyone help me? Kcompare.pep >ztr:MYCGRDRAFT_45998 MAAPLHAEGPIRTPYTGVELLNTPYLNKGTAFPADERRVLGLTALLPTSVHTLDQQLQRA WHQYQSRDNDLARNTFLTSLKEQNEVLYYRLVLDHLSEVFSIIYTPTEGEAIQRYSSLFR >kal:KALB_5042 MTAEVAVVSDGSAIPGASPPATLPLLQDYAELVREHAGLSAVPLAVDSARLAAELCALPK

using awk to find pattern if line starts with “>” and add at the end of it the number of occurences of the pattern

我的梦境 提交于 2019-12-12 04:31:15
问题 I have been struggling with awk to figure out a way to find identical patterns and add a tag at the end of them showing how many times they are present in the file. For example, if Spiroplasma_culicicola occurs 7 times, then next to the first occurrence, it should write Spiroplasma_culicicola_1, next to the second occurrence Spiroplasma_culicicola_2 next to the third occurrence Spiroplasma_culicicola_3 etc etc However I have a fasta file that looks like this: >Spiroplasma_taiwanense

R: read fasta files into data.frame using base R - NOT Biostrings (and the like)

瘦欲@ 提交于 2019-12-12 01:55:52
问题 How can one read FASTA files directly into a data frame in R using base code. These files store information bio-sequence (e.g. DNA or protein) and have 2*n lines for n individual bio-molecules (id1 through idn), and thus are of the type: >id1 #(always starts with a `>`) seq1 >id2 seq2 ... >idn seqn If one want to be in base R (instead of dedicated packages like Biostrings and seqinr , which make use of novel classes for various manipulations of bio-sequences), how can you use e.g. read.table

Reading at three different frames

若如初见. 提交于 2019-12-11 19:27:27
问题 So I'm trying to create a class that reads a DNA string in three different frames - one that starts at position 0 (or the first base), another that starts in position 1 (the second base), and a third that starts reading at position 2 (the third base). So far, this is what I've been playing around with: def codons(self, frame_one, frame_two, frame_three): start = frame_one while start + 3 <=len(self.seq): yield (self.seq[start:start+3], start) start += 3 start+1 = frame_two while start + 3 <

sed convert multiline bloc to single line (ex: fasta to phylip format)

大兔子大兔子 提交于 2019-12-11 08:19:53
问题 In short : how to convert from fasta to "phylip"-like format (without the sequence and residu counts at the top of the file) using sed ? A fasta format is like this: >sequence1 AATCG GG-AT >sequence2 AGTCG GGGAT The number of lines of a sequence may vary. I want to convert it to this: sequence1 AATCG GG-AT sequence2 AGTCG GGGAT My question seems simple, but I am lacking a real understanding of the advanced commands in sed , the multiline commands and the commands using the hold buffer. Here

Remove multiple sequences from fasta file

风流意气都作罢 提交于 2019-12-10 22:38:13
问题 I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow: >header1 aaaaaaaaa >header2 bbbbbbbbbbb >header3 aaabbbaaaa [...] >headerN aaabbaabaa In an other file I have a list of headers of sequences that I would like to remove, like this: >header1 >header5 >header12 [...] >header145 The idea is to remove these sequences from the first file, so all these headers+the following line. I did

Parse fasta sequence to the dictionary

北慕城南 提交于 2019-12-10 11:45:07
问题 I need most trivial solution to convert fasta.txt containing multiple nucleotide sequences like >seq1 TAGATTCTGAGTTATCTCTTGCATTAGCAGGTCATCCTGGTCAAACCGCTACTGTTCCGG CTTTCTGATAATTGATAGCATACGCTGCGAACCCACGGAAGGGGGTCGAGGACAGTGGTG >seq2 TCCCTCTAGAGGCTCTTTACCGTGATGCTACATCTTACAGGTATTTCTGAGGCTCTTTCA AACAGGTGCGCGTGAACAACAACCCACGGCAAACGAGTACAGTGTGTACGCCTGAGAGTA >seq3 GGTTCCGCTCTAAGCCTCTAACTCCCGCACAGGGAAGAGATGTCGATTAACTTGCGCCCA TAGAGCTCTGCGCGTGCGTCGAAGGCTCTTTTCGCGATATCTGTGTGGTCTCACTTTGGT to the dictionary

Biopython SeqIO to Pandas Dataframe

不问归期 提交于 2019-12-09 04:55:14
问题 I have a FASTA file that can easily be parsed by SeqIO.parse. I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.) from Bio import SeqIO import pandas as pd # parse sequence fasta file identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta", "fasta")] lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta", "fasta")] #converting lists to

Using Interval tree to find overlapping regions

北城以北 提交于 2019-12-08 14:28:18
问题 I have two files File 1 chr1:4847593-4847993 TGCCGGAGGGGTTTCGATGGAACTCGTAGCA File 2 Pbsn|X|75083240|75098962| TTTACTACTTAGTAACACAGTAAGCTAAACAACCAGTGCCATGGTAGGCTTGAGTCAGCT CTTTCAGGTTCATGTCCATCAAAGATCTACATCTCTCCCCTGGTAGCTTAAGAGAAGCCA TGGTGGTTGGTATTTCCTACTGCCAGACAGCTGGTTGTTAAGTGAATATTTTGAAGTCC File 1 has approximately 8000 more lines with different header and sequence below it. I would first like to match the start and end co ordinates from file1 to file 2 or see if its close to each other let