fasta

splitting a multiple FASTA file into separate files keeping their original names

我与影子孤独终老i 提交于 2019-12-08 14:12:19
问题 I am trying to work with an AWK script that was posted earlier on this forum. I am trying to split a large FASTA file containing multiple DNA sequences, into separate FASTA files. I need to separate each sequence into its own FASTA file, and the name of each of the new FASTA files needs to be the name of the DNA sequence from the original, large multifasta file (all the characters after the >). I tried this script that I found here at stackoverflow: awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT

Write a Perl script that takes in a fasta and reverses all the sequences (without BioPerl)?

眉间皱痕 提交于 2019-12-08 13:09:55
问题 I dont know if this is just a quirk with Stawberry Perl, but I can't seem to get it to run. I just need to take a fasta and reverse every sequence in it. -The problem- I have a multifasta file: >seq1 ABCDEFG >seq2 HIJKLMN and the expected output is: >REVseq1 GFEDCBA >REVseq2 NMLKJIH The script is here: $NUM_COL = 80; ## set the column width of output file $infile = shift; ## grab input sequence file name from command line $outfile = "test1.txt"; ## name output file, prepend with “REV” open

How to extract FASTA sequences from a file using sequence IDs in adifferent file?

空扰寡人 提交于 2019-12-08 11:29:38
问题 I have two files: sequence.fasta - a big file with multiple FASTA sequences ids.txt - consisting of sequence IDs in a tab-delimited format. I want to extract those sequences into another file from sequence.fasta whose IDs matched in ids.txt . A sample of sequence.fasta >AUP4056.1 MFKSLIQFFKSKSNTSNIKKENAVQRQERQDIEGWITPYSGQELLNTELRQHHLGLLWQQVSMTREMFEH LYQKPIERYAEMVQLLPASESHHHSHLGGMLDHGLEVISFAAKLRQNYVLPLNAAPEDQAKQKDAWTAAV IYLALVHDIGKSIVDIEIQLQDGKRWLAWHGIPTLPYKFRYIKQRDYELHPVLGGFIANQLIAKETFDWL

I need search a pattern in a header line of my file and concatenates the next line with Perl

主宰稳场 提交于 2019-12-08 08:43:11
问题 My multi-fasta archive is in this format: >miRNA65 dvex2345 CGATGCTAGATGCTATGACAACGATGCCTCG-G >miRNA60 dvex1234 T-TAA-ACTCATCATCATCATACTCATCATCATCATCAGCATATTAACAAG >miRNA65 dvex2345 T-TAA-ACTTATCATCATCATACTCATCATCATCATCAGCATATTAACAAG I am new in Perl and I need to search the equals '> lines' and concatenate the next line to join the sequence. I'm expecting the following output for the above file: >miRNA60 dvex1234 T-TAA-ACTCATCATCATCATACTCATCATCATCATCAGCATATTAACAAG >miRNA65 dvex2345 T-TAA

Unable to parse just sequences from FASTA file

二次信任 提交于 2019-12-08 07:04:22
问题 How can I remove ids like '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n' from sequences? I have this code: with open('sequence.fasta', 'r') as f : while True: line1=f.readline() line2=f.readline() line3=f.readline() if not line3: break fct([line1[i:i+100] for i in range(0, len(line1), 100)]) fct([line2[i:i+100] for i in range(0, len(line2), 100)]) fct([line3[i:i+100] for i in range(0, len(line3), 100)]) Output: ['>gi|2765658|emb|Z78533.1|CIZ78533 C

split a fasta file and rename on the basis of first line

こ雲淡風輕ζ 提交于 2019-12-06 12:54:24
问题 I have a huge file with following content: filename: input.txt >chr1 jdlfnhl dh,ndh dnh. dhjl >chr2 dhfl dhl dh;l >chr3 shgl sgl >chr2_random dgld I need to split this file in such a way that I get four separate file as below: file 1: chr1.fa >chr1 jdlfnhl dh,ndh dnh. dhjl file 2: chr2.fa >chr2 dhfl dhl dh;l file 3: chr3.fa >chr3 shgl sgl file 4: chr2_random.fa >chr2_random dgld I tried csplit in linux, but could not rename them by the text immediately after ">". csplit -z input.txt '/>/' '{*

chaos game for DNA sequences

一曲冷凌霜 提交于 2019-12-04 09:34:59
问题 I have tried the mathematica code for making the chaos game for DNA sequences posted in this address: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html which is like this: genome = Import["c:\data\sequence.fasta", "Sequence"]; genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}]; chars = StringCases[genome, "G" | "C" | "T" | "A"]; f[x_, "A"] := x/2; f[x_, "T"] := x/2 + {1/2, 0}; f[x_, "G"] := x/2 + {1/2, 1/2}; f[x_, "C"] := x/2 + {0, 1/2}; pts = FoldList[f, {0.5, 0.5}, chars];

Regex to remove new lines up to a specific character

萝らか妹 提交于 2019-12-04 06:15:17
问题 I have a series of strings in a file of the format: >HEADER_Text1 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada >HEADER_Text2 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada >HEADER_Text3 Information here, yada yada yada Some more information here, yada yada yada Even some more information here, yada yada yada I am trying to find a regex

Reading in file block by block using specified delimiter in python

冷暖自知 提交于 2019-12-04 03:42:44
问题 I have an input_file.fa file like this (FASTA format): > header1 description data data data >header2 description more data data data I want to read in the file one chunk at a time, so that each chunk contains one header and the corresponding data, e.g. block 1: > header1 description data data data Of course I could just read in the file like this and split: with open("1.fa") as f: for block in f.read().split(">"): pass But I want to avoid the reading the whole file into memory , because the

how use matchpattern() to find certain aminoacid in a file with many sequence(.fasta) in R

可紊 提交于 2019-12-03 21:40:32
I have a file (mydata.txt) that contains many exon sequences with fasta format. I want to find start ('atg') and stop ('taa','tga','tag') codons for each DNA sequence (considering the frame). I tried using matchPattern ( a function from the Biostrings R package) to find theses amino acids: As an example mydata.txt could be: >a atgaatgctaaccccaccgagtaa >b atgctaaccactgtcatcaatgcctaa >c atggcatgatgccgagaggccagaataggctaa >d atggtgatagctaacgtatgctag >e atgccatgcgaggagccggctgccattgactag file=read.fasta(file="mydata.txt") matchPattern( "atg" , file) Note: read.fasta is a function in seqinr package