问题
I have a directory with a few hundred *.fasta files, such as:
Bonobo_sp._str01_ABC784267_CDE789456.fasta
Homo_sapiens_cc21_ABC897867_CDE456789.fasta
Homo_sapiens_cc21_ABC893673_CDE753672.fasta
Gorilla_gorilla_ghjk6789_ABC736522_CDE789456.fasta
Gorilla_gorilla_ghjk6789_ABC627190_CDE891345.fasta
Gorilla_gorilla_ghjk6789_ABC117190_CDE661345.fasta
etc.
I want to concatenate files that belong to the same species, so in this case Homo_sapiens_cc21 and Gorilla_gorilla_ghjk6789.
Almost every species has different number of files that I need to concatenate.
I know that I could use a simple loop in unix/linux like:
for f in thesamename.fasta; do
cat $f >> output.fasta
done
But I don't know how to specify in a loop how should it recognize only files with the same beginning. Making that manually does not make sense at all with hundreds of files.
Does anybody have any idea how could I do that?
回答1:
I will assume that the logic behind the naming is that the species are the first three words separated by underscores. I will also assume that there are no blank spaces in the filenames.
A possible strategy could be to get a list of all the species, and then concatenate all the files with that specie/prefix into a single one:
for specie in $(ls *.fasta | cut -f1-3 -d_ | sort -u)
do
cat "$specie"*.fasta > "$specie.fasta"
done
In this code, you list all the fasta files, cut the specie ID and generate an unique list of species. Then you traverse this list and, for every specie, concatenate all the files that start with that specie ID into a single file with the specie name.
More robust solutions can be written using find
and avoiding ls
, but they are more verbose and potentialy less clear:
while IFS= read -r -d '' specie
do
cat "$specie"*.fasta > "$specie.fasta"
done < <(find -maxdepth 1 -name "*.fasta" -print0 | cut -z -f2 -d/ | cut -z -f1-3 -d_ | sort -zu)
回答2:
As stated in my comment above, if you know all your basenames and don't mind entering them explicitly, a simple solution would be
for f in Homo_sapiens_cc21_*.fasta;
do cat $f >> Homo_sapiens_cc21.fasta;
done
Since this is not the case, you need to find a a common pattern by which to group the output. From your examples (EDIT: and your comment), I looks like this could be three times a word followed by an underscore.
Assuming this pattern is correct, this would probably do what you require:
for f in *.fasta;
do cat $f >> $(echo $f | awk -F'_' '{print $1"_"$2"_"$3".fasta"}');
done
Explanation:
- List all the
*,fasta
files - Construct a file name from the prefix. We do this by piping through
awk
, telling it to split the input by_
(-F'_'
) and putting it back together ('{print $1"_"$2"_"$3".fasta"}'
) - Finally we
cat
the current file and redirect the output to the newly constructed file name
来源:https://stackoverflow.com/questions/53652718/how-to-concatenate-files-that-have-the-same-beginning-of-a-name