问题
I have collected the following file:
20130304;114137911;8051;somevalue1 20130304;343268;7591;NA 20130304;379612;7501;somevalue2 20130304;343380;7591;somevalue8 20130304;343380;7591;somevalue9 20130304;343212;7591;NA 20130304;183278;7851;somevalue3 20130304;114141486;8051;somevalue5 20130304;114143219;8051;somevalue6 20130304;343247;7591;NA 20130304;379612;7501;somevalue2 20130308;343380;7591;NA
This is a ;
seperated file with 4 columns. The combination of column 2 and 3 however must be unique. Since this dataset has millions of rows I'm looking for an efficient way to get the first occurence of every duplicate. I therefore need to partial match the combination of column 2 and 3 and then select the first one.
The expected outcome should be:
20130304;114137911;8051;somevalue1 20130304;343268;7591;NA 20130304;379612;7501;somevalue2 20130304;343380;7591;somevalue8 20130304;343380;7591;somevalue9 #REMOVED 20130304;343212;7591;NA 20130304;183278;7851;somevalue3 20130304;114141486;8051;somevalue5 20130304;114143219;8051;somevalue6 20130304;343247;7591;NA 20130304;379612;7501;somevalue2 #REMOVED 20130308;343380;7591;NA #$REMOVED
I have made a few attempts myself. The first one is:
grep -oE "\;(.*);" orders_20130304to20140219_v3.txt | uniq
However this selects only column 2 and 3 and removes all other data. Furthermore it does not take into account a match that occurs later. I can fix that by adding sort
, but I prefer not to sort.
Another attempt is:
awk '!x[$0]++' test.txt
This does not require any sorting, but matches the complete line.
I think the second attempt is close, but that needs to be changed in order to only look at the second and third column instead of the whole line. Does anyone know how to incorporate this?
回答1:
here you go:
awk -F';' '!a[$2 FS $3]++' file
test with your data:
kent$ awk -F';' '!a[$2 FS $3]++' f
20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
来源:https://stackoverflow.com/questions/21929071/grep-only-one-of-partial-duplicates