How to concatenate files that have the same beginning of a name?

问题

I have a directory with a few hundred *.fasta files, such as:

Bonobo_sp._str01_ABC784267_CDE789456.fasta
Homo_sapiens_cc21_ABC897867_CDE456789.fasta
Homo_sapiens_cc21_ABC893673_CDE753672.fasta 
Gorilla_gorilla_ghjk6789_ABC736522_CDE789456.fasta
Gorilla_gorilla_ghjk6789_ABC627190_CDE891345.fasta
Gorilla_gorilla_ghjk6789_ABC117190_CDE661345.fasta

etc.

I want to concatenate files that belong to the same species, so in this case Homo_sapiens_cc21 and Gorilla_gorilla_ghjk6789.

Almost every species has different number of files that I need to concatenate.

I know that I could use a simple loop in unix/linux like:

    for f in thesamename.fasta; do
        cat $f >> output.fasta
    done

But I don't know how to specify in a loop how should it recognize only files with the same beginning. Making that manually does not make sense at all with hundreds of files.

Does anybody have any idea how could I do that?

回答1:

I will assume that the logic behind the naming is that the species are the first three words separated by underscores. I will also assume that there are no blank spaces in the filenames.

A possible strategy could be to get a list of all the species, and then concatenate all the files with that specie/prefix into a single one:

for specie in $(ls *.fasta | cut -f1-3 -d_ | sort -u)
do
    cat "$specie"*.fasta > "$specie.fasta"
done

In this code, you list all the fasta files, cut the specie ID and generate an unique list of species. Then you traverse this list and, for every specie, concatenate all the files that start with that specie ID into a single file with the specie name.

More robust solutions can be written using find and avoiding ls, but they are more verbose and potentialy less clear:

while IFS= read -r -d '' specie
do
    cat "$specie"*.fasta > "$specie.fasta"
done < <(find -maxdepth 1 -name "*.fasta" -print0 | cut -z -f2 -d/ | cut -z -f1-3 -d_ | sort -zu)

回答2:

As stated in my comment above, if you know all your basenames and don't mind entering them explicitly, a simple solution would be

for f in Homo_sapiens_cc21_*.fasta; 
    do cat $f >> Homo_sapiens_cc21.fasta; 
done

Since this is not the case, you need to find a a common pattern by which to group the output. From your examples (EDIT: and your comment), I looks like this could be three times a word followed by an underscore.

Assuming this pattern is correct, this would probably do what you require:

for f in *.fasta; 
    do cat $f >> $(echo $f | awk -F'_' '{print $1"_"$2"_"$3".fasta"}'); 
done

Explanation:

List all the *,fasta files
Construct a file name from the prefix. We do this by piping through awk, telling it to split the input by _ (-F'_') and putting it back together ('{print $1"_"$2"_"$3".fasta"}')
Finally we cat the current file and redirect the output to the newly constructed file name

来源：https://stackoverflow.com/questions/53652718/how-to-concatenate-files-that-have-the-same-beginning-of-a-name

标签

regex

loops

unix

bioinformatics

pattern-recognition