Bash Script: count unique lines in file

后端 未结 3 1696
有刺的猬
有刺的猬 2020-12-12 12:30

Situation:

I have a large file (millions of lines) containing IP addresses and ports from a several hour network capture, one ip/port per line. Lines are of this

相关标签:
3条回答
  • 2020-12-12 12:57

    To count the total number of unique lines (i.e. not considering duplicate lines) we can use uniq or Awk with wc:

    sort ips.txt | uniq | wc -l
    awk '!seen[$0]++' ips.txt | wc -l
    

    Awk's arrays are associative so it may run a little faster than sorting.

    Generating text file:

    $  for i in {1..100000}; do echo $RANDOM; done > random.txt
    $ time sort random.txt | uniq | wc -l
    31175
    
    real    0m1.193s
    user    0m0.701s
    sys     0m0.388s
    
    $ time awk '!seen[$0]++' random.txt | wc -l
    31175
    
    real    0m0.675s
    user    0m0.108s
    sys     0m0.171s
    
    0 讨论(0)
  • 2020-12-12 13:09

    This is the fastest way to get the count of the repeated lines and have them nicely printed sored by the least frequent to the most frequent:

    awk '{!seen[$0]++}END{for (i in seen) print seen[i], i}' ips.txt | sort -n
    

    If you don't care about performance and you want something easier to remember, then simply run:

    sort ips.txt | uniq -c | sort -n
    

    PS:

    sort -n parse the field as a number, that is correct since we're sorting using the counts.

    0 讨论(0)
  • 2020-12-12 13:15

    You can use the uniq command to get counts of sorted repeated lines:

    sort ips.txt | uniq -c
    

    To get the most frequent results at top (thanks to Peter Jaric):

    sort ips.txt | uniq -c | sort -bgr
    
    0 讨论(0)
提交回复
热议问题