How to define the range 'r(n,)' using a variable in match() function with awk

两盒软妹~` 提交于 2019-12-25 01:39:06

问题


I am processing text files with thousands of records per file. Each record is made up of two lines: a header that starts with ">" and followed by a line with a long string of characters "-AGTCNR". Here is how a simple file looks like:

>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-----TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTTTCCT----TAAATAAT-----

Now I am trying to search through the second field (line) of each record and only extract records which have up to a certain maximum number of "-" characters (referred to as gaps) at the beginning $start_gaps and end, $end_gaps, of line(field $2).

I have tried a few codes and the following came works well:

read -p "Please enter the muximum number of gaps allowed at start position: " start_gaps &&
read -p "Please enter the maximum number of gaps allowed at the end position: " end_gaps &&
awk -v start_g=$start_gaps -v end_g=$end_gaps 'BEGIN{
RS="\n>"; FS="\n"; ORS="\n"; OFS="\n"; }; (x=start_g+1)(y=end_g+1) { 
if ( match($2, "^-{5,}") && match($2, "-{6,}$") ) {
next} else {print x y ">"$0}}' infile > outfile

But I need to keep using variable numbers without explicitly editing the script every time i am conducting the regex pattern matching. So i tried the following but the regex do not accept variables. What is the best work around to this?

read -p "Please enter the muximum number of gaps allowed at start position: " start_gaps &&
read -p "Please enter the maximum number of gaps allowed at the end position: " end_gaps &&
awk -v start_g=$start_gaps -v end_g=$end_gaps 'BEGIN{
RS="\n>"; FS="\n"; ORS="\n"; OFS="\n"; }; (x=start_g+1)(y=end_g+1) {
if ( match($2, "^-{x,}") && match($2, "-{y,}$") ) {
next} else {print x y ">"$0}}' infile > outfile

Expected results:

>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co  
-----TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA

回答1:


match sets variable RLENGTH to the matched substring's length, make use of it. Also, you don't need a multi-char RS for this.

awk -v start_g="$start_gaps" -v end_g="$end_gaps" '
/^>/ { hdr=$0; next }
match($0,/^-*/) && RLENGTH<=start_g && match($0,/-*$/) && RLENGTH<=end_g { print hdr; print }
' file


来源:https://stackoverflow.com/questions/56276934/how-to-define-the-range-rn-using-a-variable-in-match-function-with-awk

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!