Best way to simulate “group by” from bash?

前端 未结 14 1012
半阙折子戏
半阙折子戏 2020-11-29 15:03

Suppose you have a file that contains IP addresses, one address in each line:

10.0.10.1
10.0.10.1
10.0.10.3
10.0.10.2
10.0.10.1

You need a

相关标签:
14条回答
  • 2020-11-29 15:46

    It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions.

    Among those versions, saua's solution is the best (and simplest):

    sort -n ip_addresses.txt | uniq -c
    

    I found http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html. But it's ugly as hell...

    0 讨论(0)
  • 2020-11-29 15:46

    I'd have done it like this:

    perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses
    

    but uniq might work for you.

    0 讨论(0)
  • 2020-11-29 15:47

    The quick and dirty method is as follows:

    cat ip_addresses | sort -n | uniq -c

    If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.

    PS

    If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.

    0 讨论(0)
  • 2020-11-29 15:47

    You probably can use the file system itself as a hash table. Pseudo-code as follows:

    for every entry in the ip address file; do
      let addr denote the ip address;
    
      if file "addr" does not exist; then
        create file "addr";
        write a number "0" in the file;
      else 
        read the number from "addr";
        increase the number by 1 and write it back;
      fi
    done
    

    In the end, all you need to do is to traverse all the files and print the file names and numbers in them. Alternatively, instead of keeping a count, you could append a space or a newline each time to the file, and in the end just look at the file size in bytes.

    0 讨论(0)
  • 2020-11-29 15:50

    The canonical solution is the one mentioned by another respondent:

    sort | uniq -c
    

    It is shorter and more concise than what can be written in Perl or awk.

    You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.

    0 讨论(0)
  • 2020-11-29 15:54
    sort ip_addresses | uniq -c
    

    This will print the count first, but other than that it should be exactly what you want.

    0 讨论(0)
提交回复
热议问题