fasta | 易学教程

splitting a multiple FASTA file into separate files keeping their original names

阅读更多关于 splitting a multiple FASTA file into separate files keeping their original names

问题 I am trying to work with an AWK script that was posted earlier on this forum. I am trying to split a large FASTA file containing multiple DNA sequences, into separate FASTA files. I need to separate each sequence into its own FASTA file, and the name of each of the new FASTA files needs to be the name of the DNA sequence from the original, large multifasta file (all the characters after the >). I tried this script that I found here at stackoverflow: awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT

Write a Perl script that takes in a fasta and reverses all the sequences (without BioPerl)?

阅读更多关于 Write a Perl script that takes in a fasta and reverses all the sequences (without BioPerl)?

问题 I dont know if this is just a quirk with Stawberry Perl, but I can't seem to get it to run. I just need to take a fasta and reverse every sequence in it. -The problem- I have a multifasta file: >seq1 ABCDEFG >seq2 HIJKLMN and the expected output is: >REVseq1 GFEDCBA >REVseq2 NMLKJIH The script is here: $NUM_COL = 80; ## set the column width of output file $infile = shift; ## grab input sequence file name from command line $outfile = "test1.txt"; ## name output file, prepend with “REV” open

How to extract FASTA sequences from a file using sequence IDs in adifferent file?

阅读更多关于 How to extract FASTA sequences from a file using sequence IDs in adifferent file?

问题 I have two files: sequence.fasta - a big file with multiple FASTA sequences ids.txt - consisting of sequence IDs in a tab-delimited format. I want to extract those sequences into another file from sequence.fasta whose IDs matched in ids.txt . A sample of sequence.fasta >AUP4056.1 MFKSLIQFFKSKSNTSNIKKENAVQRQERQDIEGWITPYSGQELLNTELRQHHLGLLWQQVSMTREMFEH LYQKPIERYAEMVQLLPASESHHHSHLGGMLDHGLEVISFAAKLRQNYVLPLNAAPEDQAKQKDAWTAAV IYLALVHDIGKSIVDIEIQLQDGKRWLAWHGIPTLPYKFRYIKQRDYELHPVLGGFIANQLIAKETFDWL

I need search a pattern in a header line of my file and concatenates the next line with Perl

阅读更多关于 I need search a pattern in a header line of my file and concatenates the next line with Perl

问题 My multi-fasta archive is in this format: >miRNA65 dvex2345 CGATGCTAGATGCTATGACAACGATGCCTCG-G >miRNA60 dvex1234 T-TAA-ACTCATCATCATCATACTCATCATCATCATCAGCATATTAACAAG >miRNA65 dvex2345 T-TAA-ACTTATCATCATCATACTCATCATCATCATCAGCATATTAACAAG I am new in Perl and I need to search the equals '> lines' and concatenate the next line to join the sequence. I'm expecting the following output for the above file: >miRNA60 dvex1234 T-TAA-ACTCATCATCATCATACTCATCATCATCATCAGCATATTAACAAG >miRNA65 dvex2345 T-TAA

Unable to parse just sequences from FASTA file

阅读更多关于 Unable to parse just sequences from FASTA file

问题 How can I remove ids like '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n' from sequences? I have this code: with open('sequence.fasta', 'r') as f : while True: line1=f.readline() line2=f.readline() line3=f.readline() if not line3: break fct([line1[i:i+100] for i in range(0, len(line1), 100)]) fct([line2[i:i+100] for i in range(0, len(line2), 100)]) fct([line3[i:i+100] for i in range(0, len(line3), 100)]) Output: ['>gi|2765658|emb|Z78533.1|CIZ78533 C

split a fasta file and rename on the basis of first line

阅读更多关于 split a fasta file and rename on the basis of first line

问题 I have a huge file with following content: filename: input.txt >chr1 jdlfnhl dh,ndh dnh. dhjl >chr2 dhfl dhl dh;l >chr3 shgl sgl >chr2_random dgld I need to split this file in such a way that I get four separate file as below: file 1: chr1.fa >chr1 jdlfnhl dh,ndh dnh. dhjl file 2: chr2.fa >chr2 dhfl dhl dh;l file 3: chr3.fa >chr3 shgl sgl file 4: chr2_random.fa >chr2_random dgld I tried csplit in linux, but could not rename them by the text immediately after ">". csplit -z input.txt '/>/' '{*

chaos game for DNA sequences

阅读更多关于 chaos game for DNA sequences

问题 I have tried the mathematica code for making the chaos game for DNA sequences posted in this address: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html which is like this: genome = Import["c:\data\sequence.fasta", "Sequence"]; genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}]; chars = StringCases[genome, "G" | "C" | "T" | "A"]; f[x_, "A"] := x/2; f[x_, "T"] := x/2 + {1/2, 0}; f[x_, "G"] := x/2 + {1/2, 1/2}; f[x_, "C"] := x/2 + {0, 1/2}; pts = FoldList[f, {0.5, 0.5}, chars];

Regex to remove new lines up to a specific character

阅读更多关于 Regex to remove new lines up to a specific character

问题 I have a series of strings in a file of the format: >HEADER_Text1 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada >HEADER_Text2 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada >HEADER_Text3 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada I am trying to find a regex

Reading in file block by block using specified delimiter in python

阅读更多关于 Reading in file block by block using specified delimiter in python

问题 I have an input_file.fa file like this (FASTA format): > header1 description data data data >header2 description more data data data I want to read in the file one chunk at a time, so that each chunk contains one header and the corresponding data, e.g. block 1: > header1 description data data data Of course I could just read in the file like this and split: with open("1.fa") as f: for block in f.read().split(">"): pass But I want to avoid the reading the whole file into memory , because the

how use matchpattern() to find certain aminoacid in a file with many sequence(.fasta) in R

阅读更多关于 how use matchpattern() to find certain aminoacid in a file with many sequence(.fasta) in R

I have a file (mydata.txt) that contains many exon sequences with fasta format. I want to find start ('atg') and stop ('taa','tga','tag') codons for each DNA sequence (considering the frame). I tried using matchPattern ( a function from the Biostrings R package) to find theses amino acids: As an example mydata.txt could be: >a atgaatgctaaccccaccgagtaa >b atgctaaccactgtcatcaatgcctaa >c atggcatgatgccgagaggccagaataggctaa >d atggtgatagctaacgtatgctag >e atgccatgcgaggagccggctgccattgactag file=read.fasta(file="mydata.txt") matchPattern( "atg" , file) Note: read.fasta is a function in seqinr package