问题
I have two files (recode and reads) that were built and saved with nano command and I want to compare what has on recode to reads and extract the lines in reads that overlaps. I have been trying to create a when loop with the previous logic on mind, but without success so far. The output data is not matching with the pattern specified in the loop while
with grep/recode. The script was supposed to read each line in recode.txt compare to reads.fastq, extract each match line plus one line before and 2 after in the reads.txt and save the output in different files (for all combined match lines per line of the recode.txt). Here are the tables and code:
File recode.txt
:
GTGTCTTA+ATCACGAC
GTGTCTTA+ACAGTGGT
GTGTCTTA+CAGATCCA
GTGTCTTA+ACAAACGG
GTGTCTTA+ACCCAGCA
GTGTCTTA+AACCCCTC
GTGTCTTA+CCCAACCT
ATCACGAC+AAGGTTCA
GTGTCTTA+GAAACCCA
File reads.fastq
:
###################################
@NB500931:113:HW53WBGX2:1:11101:11338:1049 1:N:0:ATCACGAC+AAGGTTCA
GTAGTNCCAGCTGCAGAGCTGGAAGGATCGCTTGAGCGCAGAGGTAGAGGCTACAGTGAGCCGTGATCATGCCAT
+
AAAAA#EAAEEEEE6EAEAEEEEEEEEEEEEEEEAEEEEEE/EEEEEEEEEE/EEEEEEEEEEEEEEEAEEEEEA
@NB500931:113:HW53WBGX2:1:11101:6116:1049 1:N:0:ACAAACGG+AAGGTTCA
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
###################################
@NB500931:113:HW53WBGX2:1:11101:6885:1049 1:N:0:ACCCAGCA+ACTTAGCA
GAGGGNGCTGTCCCAGTAATTGGGTTCAGATGACATTTGCTTGATTTTAGGGATGTACGAGATTTTCGTGGATC
+
AAA/A#EAEEEEEAEAEEA///EEAEEEEE///AEEAEE/AA//EAA<EEE/E//AEEEAAA//E/A<6//EEA
@NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCA
CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC
+
Script:
#!/bin/bash
#PBS -l nodes=1:ppn=8,walltime=96:00:00
while read line
do
echo "working on $line"
grep -A3 "$line" reads.fastq | grep -v "^--$" >> "$line"_sorted.fastq
done<recode.txt
So, both files are in UNIX format and the following script (without a loop) works smooth
According to the script without the looping:
grep -A3 "ATCACGAC+AAGGTTCA" reads.fastq | grep -v "^--$" > sorted_file.fastq
my output should be:
@NB500931:113:HW53WBGX2:1:11101:11338:1049 1:N:0:ATCACGAC+AAGGTTCA
GTAGTNCCAGCTGCAGAGCTGGAAGGATCGCTTGAGCGCAGAGGTAGAGGCTACAGTGAGCCGTGATCATGCCAT
+
@NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCA
CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC
+
However, my output using the loop while
give me a empty file with the correct name. Can you please help me?
UPDATE: I have tried dos2unix to convert my files and it didn't work. UPDATE: I edited the question to include my expected output
回答1:
Without seeing the expected output it's a guess but it sounds like this is what you're trying to do:
$ awk -F: 'NR==FNR{a[$0];next} $NF in a{c=3} c&&c--' recode.txt reads.fastq
@NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCA
CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC
+
No shell loop required (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice for SOME of the reasons why that matters), just saves the values from recode.txt as array indices and then when reading reads.fastq if the last :-separated field is an index of the array (i.e. existed in recode.txt) then set a counter to 3 and then print every line while the counter is greater than zero, decrementing the counter each time (see printing-with-sed-or-awk-a-line-following-a-matching-pattern for other examples of printing text starting from a match).
To save each found record in a file based on the string name in that final field as it looks like you might be trying to do in your shell loop would be:
awk -F: '
NR==FNR { a[$0]; next }
$NF in a { c=3; close(out); out=$NF"_sorted.fastq" }
c&&c-- { print >> out }
' recode.txt reads.fastq
Note that that just reads "reads.fastq" once total, not once per line of "recode.txt" as your shell loop was doing, so you can expect a vast performance improvement from that aspect alone.
Finally - if recode.txt is just a list of ALL of the final fields that exist in reads.fastq then you simply don't need it, this is all you need to split reads.fastq into separate files of 3 lines per record named based on the value after the last :
on each line that starts with @
:
awk -F: '
/^@/ { c=3; close(out); out=$NF"_sorted.fastq" }
c&&c-- { print >> out }
' reads.fastq
来源:https://stackoverflow.com/questions/45644898/grep-issues-when-using-two-files-ive-tried-everything