Optimising my script which lookups into a big compressed file

问题

I'm here again ! I would like to optimise my bash script in order to lower the time spent for each loop. Basically what it does is :

getting an info from a tsv
using that information to lookup with awk into a file
printing the line and exporting it

My issues are : 1) the files are 60GB compressed files : I need a software to uncompress it (I'm actually trying now to uncompress it, not sure I'll have enough space) 2) It is long to look into it anyway

My ideas to improve it :

0) as said, if possible I'll decompress the file
using GNU parallel with parallel -j 0 ./extract_awk_reads_in_bam.sh ::: reads_id_and_pos.tsv but I'm unsure it works as expected? I'm cutting down the time per research from 36 min to 16 so just a factor 2.5 ? (I have 16 cores)
I was thinking (but It may be redundant with GNU?) to split down my list of info to look into into several files to launch them parallely
sorting the bam file by reads name, and exiting awk after having found 2 matches (can't be more than 2)

Here is the rest of my bash script, I'm really open for ideas to improve it but I'm not sure I am a superstar in programming, so maybe keeping it simple would help? :)

My bash script :

#/!bin/bash
while IFS=$'\t' read -r READ_ID_WH POS_HOTSPOT; do
echo "$(date -Iseconds) read id is : ${READ_ID_WH} with position ${POS_HOTSPOT}" >> /data/bismark2/reads_done_so_far.txt
echo "$(date -Iseconds) read id is : ${READ_ID_WH} with position ${POS_HOTSPOT}"
samtools view -@ 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v read_id="$READ_ID_WH" -v pos_hotspot="$POS_HOTSPOT" '$1==read_id {printf $0 "\t%s\twh_genome",pos_hotspot}'| head -2 >> /data/bismark2/export_reads_mapped.tsv
done <"$1"

My tsv file has a format like :

READ_ABCDEF\t1200

Thank you a lot ++

回答1:

TL;DR

Your new script will be:

#!/bin/bash
samtools view -@ 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'

You are reading the entire file for each of the inputs. Better look for all of them at the same time. Start by extracting the interesting reads and then, on this subset, apply the second transformation.

samtools view -@ 2 "$bam" | grep -f <(awk -F$'\t' '{print $1}' "$1") > "$sam"

Here you are getting all the reads with samtools and searching for all the terms that appear in the -f parameter of grep. That parameter is a file that contains the first column of the search input file. The output is a sam file with only the reads that are listed in the search input files.

awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {print $0, st_array[$1], "wh_genome"}' "$sam"

Finally, use awk for adding the extra information:

Open the search input file with awk at the beginning and read its contents into an array (st_array)
Set the Output Field Separator to the tabulator
Traverse the sam file and add the extra information from the pre-populated array.

I'm proposing this schema because I feel like grep is faster than awk for doing the search, but the same result can be obtained with awk alone:

samtools view -@ 2 "$bam" | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'

In this case, you only need to add a conditional to identify the interesting reads and get rid of the grep.

In any case you need to re-read the file more than once or to decompress it before working with it.

来源：https://stackoverflow.com/questions/60337568/optimising-my-script-which-lookups-into-a-big-compressed-file

标签

bash

bioinformatics

gnu-parallel