Grep only one of partial duplicates

早过忘川 提交于 2019-12-23 05:29:34

问题


I have collected the following file:

20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343380;7591;somevalue9
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
20130304;379612;7501;somevalue2
20130308;343380;7591;NA

This is a ; seperated file with 4 columns. The combination of column 2 and 3 however must be unique. Since this dataset has millions of rows I'm looking for an efficient way to get the first occurence of every duplicate. I therefore need to partial match the combination of column 2 and 3 and then select the first one.

The expected outcome should be:

20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343380;7591;somevalue9 #REMOVED
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
20130304;379612;7501;somevalue2 #REMOVED
20130308;343380;7591;NA #$REMOVED

I have made a few attempts myself. The first one is:

grep -oE "\;(.*);" orders_20130304to20140219_v3.txt | uniq 

However this selects only column 2 and 3 and removes all other data. Furthermore it does not take into account a match that occurs later. I can fix that by adding sort, but I prefer not to sort.

Another attempt is:

awk '!x[$0]++' test.txt

This does not require any sorting, but matches the complete line.

I think the second attempt is close, but that needs to be changed in order to only look at the second and third column instead of the whole line. Does anyone know how to incorporate this?


回答1:


here you go:

awk -F';' '!a[$2 FS $3]++' file

test with your data:

kent$  awk -F';' '!a[$2 FS $3]++' f 
20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA


来源:https://stackoverflow.com/questions/21929071/grep-only-one-of-partial-duplicates

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!