How to remove overlap in numeric ranges (AWK)

问题

I'm trying to remove the overlap within a file.

There's a bunch of records that starts with an 'A' and which have a 'start-value' and an 'end-value'.
There's also a bunch of records that start with a 'B', also has range and that shows a possible overlap with records starting with 'A'. The idea is to remove the overlapping range from A so only non-overlapping ranges exist.

Some of the records in B have an identical 'start-value' while others have an identical 'end-value' with A. So, if A has a range of 0 - 100 and B has a range of 0 - 32. Then my expected output is: A 33 - 100 and B 0 - 32.

Although I have a lot of files that needs to undergo this operation, the individual files are very small.

This is an example file:

A   0       100
A   101     160 
A   200     300
A   500     1100
A   1200    1300
A   1301    1340
A   1810    2000
B   0       32
B   500     540
B   1250    1300
B   1319    1340
B   1920    2000

Expected sample output

A   33      100
A   101     160 
A   200     300
A   541     1100
A   1200    1249
A   1301    1318
A   1810    1919
B   0       32
B   500     540
B   1250    1300
B   1319    1340
B   1920    2000

Thanks for all your help!

回答1:

Ok, since OP confirmed that the B 501 540 is typo, I post my answer :)

awk -v OFS="\t" '/^A/{s[NR]=$2;e[NR]=$3;l=NR}
/^B/{ 
        for(i=1;i<=l;i++){
                if(s[i]==$2){
                        s[i]=$3+1
                        break
                }else if(e[i]==$3){
                        e[i]=$2-1
                        break
                }
        }
        s[NR] = $2; e[NR]=$3
}
END{for(i=1;i<=NR;i++)print ((i<=l)?"A":"B"),s[i],e[i]}
        ' file

test with your file (the typo was fixed):

kent$  awk -v OFS="\t" '/^A/{s[NR]=$2;e[NR]=$3;l=NR}
/^B/{ 
        for(i=1;i<=l;i++){
                if(s[i]==$2){
                        s[i]=$3+1
                        break
                }else if(e[i]==$3){
                        e[i]=$2-1
                        break
                }
        }
        s[NR] = $2; e[NR]=$3
}
END{for(i=1;i<=NR;i++)print ((i<=l)?"A":"B"),s[i],e[i]}
        ' file
    A       33      100
    A       101     160
    A       200     300
    A       541     1100
    A       1200    1249
    A       1301    1318
    A       1810    1919
    B       0       32
    B       500     540
    B       1250    1300
    B       1319    1340
    B       1920    2000

EDIT for 6 columns:

dirty and quick, pls check the below example:

file:

kent$  cat file
A   0       100 1 2 3
A   101     160 4 5 6
A   200     300 7 8 9
A   500     1100 10 11 12
A   1200    1300 13 14 15
A   1301    1340 16 17 18
A   1810    2000 19 20 21
B   0       32  22 23 24
B   500     540 22 23 24
B   1250    1300 22 23 24
B   1319    1340 22 23 24
B   1920    2000 22 23 24

awk :

kent$  awk -v OFS="\t" '{s[NR]=$2;e[NR]=$3}
/^A/{l=NR}
/^B/{ 
        for(i=1;i<=l;i++){
                if(s[i]==$2){
                        s[i]=$3+1
                        break
                }else if(e[i]==$3){
                        e[i]=$2-1
                        break
                }
        }
}
{r[NR]=$4OFS$5OFS$6}
END{for(i=1;i<=NR;i++)print ((i<=l)?"A":"B"),s[i],e[i],r[i]} ' file
A       33      100     1       2       3
A       101     160     4       5       6
A       200     300     7       8       9
A       541     1100    10      11      12
A       1200    1249    13      14      15
A       1301    1318    16      17      18
A       1810    1919    19      20      21
B       0       32      22      23      24
B       500     540     22      23      24
B       1250    1300    22      23      24
B       1319    1340    22      23      24
B       1920    2000    22      23      24

来源：https://stackoverflow.com/questions/16638951/how-to-remove-overlap-in-numeric-ranges-awk

标签

awk

overlap

genetic-programming