fasta

Remove duplicated fasta sequence (bash of biopython method)

岁酱吖の 提交于 2019-12-18 09:52:00
问题 Hello I have a fasta file such as : >sequence1_CP [seq virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence2 [virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence3 MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence4_CP hypothetical protein [another virus]

fasta: delete sequences after n length

大城市里の小女人 提交于 2019-12-13 19:33:30
问题 I have multiple fasta files with 1000s of seqs in each file of varying length. I would like to keep only the first 200 (n) bases from each sequence. How can I do this in Perl? 回答1: Difficult to understand exactly what you mean without seeing an example but if you only need the first 200 characters per line just use cut : cut -c1-200 file 回答2: If the sequence is printed on several physical lines, only print up through the 200th character. A line starting with a wedge is a header line, which

Add multiple sequences from a FASTA file to a list in python

旧城冷巷雨未停 提交于 2019-12-13 15:42:46
问题 I'm trying to organize file with multiple sequences . In doing so, I'm trying to add the names to a list and add the sequences to a separate list that is parallel with the name list . I figured out how to add the names to a list but I can't figure out how to add the sequences that follow it into separate lists . I tried appending the lines of sequence into an empty string but it appended all the lines of all the sequences into a single string . all the names start with a '>' def Name

Select sequences in a fasta file with more than 300 aa and “C” occurs at least 4 times

故事扮演 提交于 2019-12-13 14:28:10
问题 I have a fasta file which contains protein sequences. I'd like to select sequences with more than 300 amino acids and Cysteine (C) amino acid appears more than 4 times. I've used this command to select sequences with more than 300 aa: cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }' Some sequence example: >jgi|Triasp1|216614|CE216613_3477 MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI

How do I get gene features in FASTA nucleotide format from NCBI using Perl?

与世无争的帅哥 提交于 2019-12-13 13:11:34
问题 I am able to download a FASTA file manually that looks like: >lcl|CR543861.1_gene_1... ATGCTTTGGACA... >lcl|CR543861.1_gene_2... GTGCGACTAAAA... by clicking "Send to" and selecting "Gene Features", FASTA Nucleotide is the only option (which is fine because that's all I want) on this page. With a script like this: #!/usr/bin/env perl use strict; use warnings; use Bio::DB::EUtilities; my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch', -db => 'nucleotide', -id => 'CR543861', -rettype =>

Multifasta header trimming

烈酒焚心 提交于 2019-12-13 09:18:51
问题 I have a multifasta file and I need to delete some part of the header for every fasta file. For example: >Viridibacillus_arenosi_FSL_R5_0213-BK137_RS04360-22-CBS_domain-containing_protein <unknown description> GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC CTGCTC------AAA >Viridibacillus_arvi_DSM16317-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description> GCGAATGAAGTTATTGGCCTAGTAACAGAAAGGGATATAAAAAACGCCCTTCCATCTTCC CTGCTC------AAA I need to delete the part after

Automatically rename fasta files with the ID of the first sequence in each file

别来无恙 提交于 2019-12-13 04:22:26
问题 I have multiple fasta files with single sequence in the same directory. I want to rename each fasta file with the header of the single sequence present in the fasta file. When i run my code , i obtain "Substitution pattern not terminated at (user-supplied code)" my code: #!/bin/bash for i in /home/maryem/files/; do if [ ! -f $i ]; then echo "skipping $i"; else newname=`head -1 $i | sed 's/^\s*\([a-zA-Z0-9]\+\).*$/\1/'`; [ -n "$newname" ] ; mv -i $i $newname.fasta || echo "error at: $i"; fi;

loop over ids from two FASTA files

半城伤御伤魂 提交于 2019-12-12 22:33:23
问题 I have two fasta files with multiple sequences cat file1.fasta >1 ACGTCGAT >2 ACTTTATT >3 ACGGGG cat file2.fasta >1 CCGGAGC >2 TGTCAGTC >3 CTACGTCTT I also have a list of IDs for each fasta file that I want to use to extract specific sequences by ID, make a 2 sequence fasta and then perform some operations (align, calc distance). Lists: cat file1.list 1 3 cat file2.list 2 1 In reality these fasta files and lists are thousands of sequences/lines long I am trying to loop over each line in the

How to search and isolate attributes of FASTA formatted text in R

☆樱花仙子☆ 提交于 2019-12-12 20:23:39
问题 I have a FASTA formatted file, which is essentially a special text file, containing many entries, one of which looks like below, which I have assigned by the name "FASTA" in R. The original file was red and formated as seen below using seqinr package in R. FASTA<- structure(list(`tr|A1Z6G9|A1Z6G9_DROME` = structure(

Filtering a FASTA file based on sequence with BioPython

痞子三分冷 提交于 2019-12-12 16:30:21
问题 I have a fasta file. From that file, I need to get the only sequences containing GTACAGTAGG and CAACGGTTTTGCC at the end and/or start of the sequence and put them in a new fasta file. So here's an example: >m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/2516_3269 ***GTACAGTAGG***GTACACACAGAACGCGACAAGGCCAGGCGCTGGAGGAACTCCAGCAGCTAGATGCAAGCGACTA TCAGAGCGTTGGGTCCAGAACGAAGAACAGTCACTCAAGACTGCTTT***CAACGGTTTTGCC*** >m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7