问题
I'm trying to remove the overlap within a file.
- There's a bunch of records that starts with an 'A' and which have a 'start-value' and an 'end-value'.
- There's also a bunch of records that start with a 'B', also has range and that shows a possible overlap with records starting with 'A'. The idea is to remove the overlapping range from A so only non-overlapping ranges exist.
Some of the records in B have an identical 'start-value' while others have an identical 'end-value' with A. So, if A has a range of 0 - 100 and B has a range of 0 - 32. Then my expected output is: A 33 - 100 and B 0 - 32.
Although I have a lot of files that needs to undergo this operation, the individual files are very small.
This is an example file:
A 0 100
A 101 160
A 200 300
A 500 1100
A 1200 1300
A 1301 1340
A 1810 2000
B 0 32
B 500 540
B 1250 1300
B 1319 1340
B 1920 2000
Expected sample output
A 33 100
A 101 160
A 200 300
A 541 1100
A 1200 1249
A 1301 1318
A 1810 1919
B 0 32
B 500 540
B 1250 1300
B 1319 1340
B 1920 2000
Thanks for all your help!
回答1:
Ok, since OP confirmed that the B 501 540
is typo, I post my answer :)
awk -v OFS="\t" '/^A/{s[NR]=$2;e[NR]=$3;l=NR}
/^B/{
for(i=1;i<=l;i++){
if(s[i]==$2){
s[i]=$3+1
break
}else if(e[i]==$3){
e[i]=$2-1
break
}
}
s[NR] = $2; e[NR]=$3
}
END{for(i=1;i<=NR;i++)print ((i<=l)?"A":"B"),s[i],e[i]}
' file
test with your file (the typo was fixed):
kent$ awk -v OFS="\t" '/^A/{s[NR]=$2;e[NR]=$3;l=NR}
/^B/{
for(i=1;i<=l;i++){
if(s[i]==$2){
s[i]=$3+1
break
}else if(e[i]==$3){
e[i]=$2-1
break
}
}
s[NR] = $2; e[NR]=$3
}
END{for(i=1;i<=NR;i++)print ((i<=l)?"A":"B"),s[i],e[i]}
' file
A 33 100
A 101 160
A 200 300
A 541 1100
A 1200 1249
A 1301 1318
A 1810 1919
B 0 32
B 500 540
B 1250 1300
B 1319 1340
B 1920 2000
EDIT for 6 columns:
dirty and quick, pls check the below example:
file:
kent$ cat file
A 0 100 1 2 3
A 101 160 4 5 6
A 200 300 7 8 9
A 500 1100 10 11 12
A 1200 1300 13 14 15
A 1301 1340 16 17 18
A 1810 2000 19 20 21
B 0 32 22 23 24
B 500 540 22 23 24
B 1250 1300 22 23 24
B 1319 1340 22 23 24
B 1920 2000 22 23 24
awk :
kent$ awk -v OFS="\t" '{s[NR]=$2;e[NR]=$3}
/^A/{l=NR}
/^B/{
for(i=1;i<=l;i++){
if(s[i]==$2){
s[i]=$3+1
break
}else if(e[i]==$3){
e[i]=$2-1
break
}
}
}
{r[NR]=$4OFS$5OFS$6}
END{for(i=1;i<=NR;i++)print ((i<=l)?"A":"B"),s[i],e[i],r[i]} ' file
A 33 100 1 2 3
A 101 160 4 5 6
A 200 300 7 8 9
A 541 1100 10 11 12
A 1200 1249 13 14 15
A 1301 1318 16 17 18
A 1810 1919 19 20 21
B 0 32 22 23 24
B 500 540 22 23 24
B 1250 1300 22 23 24
B 1319 1340 22 23 24
B 1920 2000 22 23 24
来源:https://stackoverflow.com/questions/16638951/how-to-remove-overlap-in-numeric-ranges-awk