loop over ids from two FASTA files

问题

I have two fasta files with multiple sequences

cat file1.fasta
>1
ACGTCGAT
>2
ACTTTATT
>3
ACGGGG

cat file2.fasta
>1
CCGGAGC
>2
TGTCAGTC
>3
CTACGTCTT

I also have a list of IDs for each fasta file that I want to use to extract specific sequences by ID, make a 2 sequence fasta and then perform some operations (align, calc distance).

Lists:

cat file1.list
1
3
cat file2.list
2
1

In reality these fasta files and lists are thousands of sequences/lines long

I am trying to loop over each line in the lists to extract the fasta file that matches that particular id/line, then combine the fasta sequence from each file in to a two sequence fasta file that can be aligned, etc. Basically, I want a pairwise alignment of each fasta sequence with its "pair".

So based on the example here, and the list ID order, I want to pair fasta sequence 1 from file1.fasta with fasta sequence 2 from file2.fasta, then move on to the next pair (sequence 3 from file1.fasta, and sequence 1 from file2.fasta, etc). Extracting fasta sequences based on id is relatively easy (a few ways to do it), but one is faOneRecord which just takes as input the fasta file you want to extract from, then the record/id you want to find, and returns the fasta sequence and header:

faOneRecord <in.fa> <recordName>

So, after the first loop, I would have this file created based on the id list:

>1
ACGTCGAT
>2
TGTCAGTC

and so on.

I would think this is relatively easy to do, but I can't seem to get there. Then once I make that 2 sequence fasta, each loop, I want to align and get distance estimates, print out to a file and go to the next loop. The rest of that may take some work and requires specific programs, but I need help just producing the 2 sequence fasta extracted/looped over the ids.

I guess the major question is how to loop over the ids and then pipe those IDs as arguments into the faOneRecord command

This might be too specific, and if so I apologize, but any ideas on how to get started would be helpful and much appreciated.

回答1:

Here's an (incomplete) sketch of a python solution. As I said in the comment, there's two steps:

First, read both files in arrays. If you are sure they are exactly as in your example, you can just ignore the >x lines:

fasta1 = [''] # make sure the first item is saved to fasta1[1], not fasta[0]
for line in open('file1.fasta'):
    if not line.startswith('>'):
        fasta1.append(line.strip())

The for line in open() just opens the file and iterates over its lines.

Do the same for file2. Then you can read the list files alternatingly, get the numbers out and print the matching sequence:

for l1, l2 in zip(open('file1.list'), open('file2.list')):
    print(fasta1[int(l1)])
    print(fasta2[int(l1)])

zip takes the two files and reads them in in parallel, so that the first time the loop is executed, l1 and l2 contain the first line of file1.list and file2.list, respectively; the second time, it's the second line of each, etc.

来源：https://stackoverflow.com/questions/42334644/loop-over-ids-from-two-fasta-files

标签

bash

loops

pipe

fasta