awk + How do I find duplicates in a column?

后端 未结 3 1271
孤街浪徒
孤街浪徒 2021-02-09 09:27

How do I find duplicates in a column?

$ head countries_lat_long_int_code3.csv | cat -n
     1  country,latitude,longitude,name,code
     2  AD,42.546245,1.601554         


        
相关标签:
3条回答
  • 2021-02-09 09:52

    This is the less memory aggressive i can guess:

    $ cat infile
    country,latitude,longitude,name,code
    AD,42.546245,1.601554,Andorra,376
    AE,23.424076,53.847818,United Arab Emirates,971
    AF,33.93911,67.709953,Afghanistan,93
    AG,17.060816,-61.796428,Antigua and Barbuda,1
    AI,18.220554,-63.068615,Anguilla,1
    AL,41.153332,20.168331,Albania,355
    AM,40.069099,45.038189,Armenia,374
    AN,12.226079,-69.060087,Netherlands Antilles,599
    AO,-11.202692,17.873887,Angola,355
    
    $ awk -F\, '$NF in a{if (a[$NF]!=0){print a[$NF];a[$NF]=0}print;next}{a[$NF]=$0}' infile
    AG,17.060816,-61.796428,Antigua and Barbuda,1
    AI,18.220554,-63.068615,Anguilla,1
    AL,41.153332,20.168331,Albania,355
    AO,-11.202692,17.873887,Angola,355
    

    NOTE: I have included another duplicate for testing purposes.

    0 讨论(0)
  • 2021-02-09 09:56

    If you just want to print out a unique value that repeat over the same file just add at the end of the awk:

    awk ... ... | sort | uniq -u

    That will print the unique values only on alphabetic order

    0 讨论(0)
  • 2021-02-09 09:56

    This will give you the duplicated codes

    awk -F, 'a[$5]++{print $5}'
    

    if you're only interested in count of duplicate codes

    awk -F, 'a[$5]++{count++} END{print count}'
    

    To print duplicated rows try this

    awk -F, '$5 in a{print a[$5]; print} {a[$5]=$0}'
    

    This will print the whole row with duplicates found in col $5:

    awk -F, 'a[$5]++{print $0}'
    
    0 讨论(0)
提交回复
热议问题