Best way to simulate “group by” from bash?

前端未结

关注

 14  1012

Suppose you have a file that contains IP addresses, one address in each line:

10.0.10.1
10.0.10.1
10.0.10.3
10.0.10.2
10.0.10.1

You need a

相关标签:

14条回答

忘了有多久

2020-11-29 15:46
It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the ~~quadratic~~ superlinear versions.

Among those versions, saua's solution is the best (and simplest):
```
sort -n ip_addresses.txt | uniq -c
```
I found http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html. But it's ugly as hell...
0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-11-29 15:46
I'd have done it like this:
```
perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses
```
but uniq might work for you.
0 讨论(0)
发布评论:

提交评论
- 加载中...
执笔经年

2020-11-29 15:47

The quick and dirty method is as follows:

cat ip_addresses | sort -n | uniq -c

If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.

PS

If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.

0 讨论(0)
发布评论:

提交评论
- 加载中...
生来不讨喜

2020-11-29 15:47
You probably can use the file system itself as a hash table. Pseudo-code as follows:
```
for every entry in the ip address file; do
  let addr denote the ip address;

  if file "addr" does not exist; then
    create file "addr";
    write a number "0" in the file;
  else 
    read the number from "addr";
    increase the number by 1 and write it back;
  fi
done
```
In the end, all you need to do is to traverse all the files and print the file names and numbers in them. Alternatively, instead of keeping a count, you could append a space or a newline each time to the file, and in the end just look at the file size in bytes.
0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2020-11-29 15:50
The canonical solution is the one mentioned by another respondent:
```
sort | uniq -c
```
It is shorter and more concise than what can be written in Perl or awk.

You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.
0 讨论(0)
发布评论:

提交评论
- 加载中...
日久生厌

2020-11-29 15:54
```
sort ip_addresses | uniq -c
```
This will print the count first, but other than that it should be exactly what you want.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页