Bash Script: count unique lines in file

后端未结

关注

 3  1696

Situation:

I have a large file (millions of lines) containing IP addresses and ports from a several hour network capture, one ip/port per line. Lines are of this

相关标签:

3条回答

悲哀的现实

2020-12-12 12:57

To count the total number of unique lines (i.e. not considering duplicate lines) we can use uniq or Awk with wc:

sort ips.txt | uniq | wc -l
awk '!seen[$0]++' ips.txt | wc -l

Awk's arrays are associative so it may run a little faster than sorting.

Generating text file:

$  for i in {1..100000}; do echo $RANDOM; done > random.txt
$ time sort random.txt | uniq | wc -l
31175

real    0m1.193s
user    0m0.701s
sys     0m0.388s

$ time awk '!seen[$0]++' random.txt | wc -l
31175

real    0m0.675s
user    0m0.108s
sys     0m0.171s

0 讨论(0)

说谎

2020-12-12 13:09
This is the fastest way to get the count of the repeated lines and have them nicely printed sored by the least frequent to the most frequent:
```
awk '{!seen[$0]++}END{for (i in seen) print seen[i], i}' ips.txt | sort -n
```
If you don't care about performance and you want something easier to remember, then simply run:
```
sort ips.txt | uniq -c | sort -n
```
PS:

sort -n parse the field as a number, that is correct since we're sorting using the counts.
0 讨论(0)
发布评论:

提交评论
- 加载中...
说谎

2020-12-12 13:15
You can use the uniq command to get counts of sorted repeated lines:
```
sort ips.txt | uniq -c
```
To get the most frequent results at top (thanks to Peter Jaric):
```
sort ips.txt | uniq -c | sort -bgr
```
0 讨论(0)
发布评论:

提交评论
- 加载中...