问题
I tagged python and perl in this only because that's what I've used thus far. If anyone knows a better way to go about this I'd certainly be willing to try it out. Anyway, my problem:
I need to create an input file for a gene prediction program that follows the following format:
seq1 5 15
seq1 20 34
seq2 50 48
seq2 45 36
seq3 17 20
Where seq# is the geneID and the numbers to the right are the positions of exons within an open reading frame. Now I have this information, in a .gff3 file that has a lot of other information. I can open this with excel and easily delete the columns with non-relevant data. Here's how it's arranged now:
PITG_00002 . gene 2 397 . + . ID=g.1;Name=ORF%
PITG_00002 . mRNA 2 397 . + . ID=m.1;
**PITG_00002** . exon **2 397** . + . ID=m.1.exon1;
PITG_00002 . CDS 2 397 . + . ID=cds.m.1;
PITG_00004 . gene 1 1275 . + . ID=g.3;Name=ORF%20g
PITG_00004 . mRNA 1 1275 . + . ID=m.3;
**PITG_00004** . exon **1 1275** . + . ID=m.3.exon1;P
PITG_00004 . CDS 1 1275 . + . ID=cds.m.3;P
PITG_00004 . gene 1397 1969 . + . ID=g.4;Name=
PITG_00004 . mRNA 1397 1969 . + . ID=m.4;
**PITG_00004** . exon **1397 1969** . + . ID=m.4.exon1;
PITG_00004 . CDS 1397 1969 . + . ID=cds.m.4;
So I need only the data that is in bold. For example,
PITG_0002 2 397
PITG_00004 1 1275
PITG_00004 1397 1969
Any help you could give would be greatly appreciated, thanks!
Edit: Well I messed up the formatting. Anything that is between the **'s is what I need lol.
回答1:
It looks like your data is tab-separated.
This Perl program will print columns 1, 4 and 5 from all records that have exon
in the third column. You need to change the file name in the open
statement to your actual file name.
use strict;
use warnings;
open my $fh, '<', 'genes.gff3' or die $!;
while (<$fh>) {
chomp;
my @fields = split /\t/;
next unless @fields >= 5 and $fields[2] eq 'exon';
print join("\t", @fields[0,3,4]), "\n";
}
output
PITG_00002 2 397
PITG_00004 1 1275
PITG_00004 1397 1969
回答2:
In Unix:
grep <file.gff3 " exon " |
sed "s/^\([^ ]+\) +[.] +exon +\([0-9]+\) \([0-9]+\).*$/\1 \2 \3/"
回答3:
For pedestrians:
(this is Python)
with open(data_file) as f:
for line in f:
tokens = line.split()
if len(tokens) > 3 and tokens[2] == 'exon':
print tokens[0], tokens[3], tokens[4]
which prints
PITG_00002 2 397
PITG_00004 1 1275
PITG_00004 1397 1969
回答4:
Here's a Perl script option perl scriptName.pl file.gff3
:
use strict;
use warnings;
while (<>) {
print "@{ [ (split)[ 0, 3, 4 ] ] }\n" if /exon/;
}
Output:
PITG_00002 2 397
PITG_00004 1 1275
PITG_00004 1397 1969
Or you could just do the following:
perl -n -e 'print "@{ [ (split)[ 0, 3, 4 ] ] }\n" if /exon/' file.gff3
To save the data to a file:
use strict;
use warnings;
open my $inFH, '<', 'file.gff3' or die $!;
open my $outFH, '>>', 'data.txt' or die $!;
while (<$inFH>) {
print $outFH "@{ [ (split)[ 0, 3, 4 ] ] }\n" if /exon/;
}
来源:https://stackoverflow.com/questions/14286480/extracting-specific-data-from-a-file-and-writing-it-to-another-file