Best way to simulate “group by” from bash?

前端 未结 14 1013
半阙折子戏
半阙折子戏 2020-11-29 15:03

Suppose you have a file that contains IP addresses, one address in each line:

10.0.10.1
10.0.10.1
10.0.10.3
10.0.10.2
10.0.10.1

You need a

相关标签:
14条回答
  • 2020-11-29 16:06

    Most of the other solutions count duplicates. If you really need to group key value pairs, try this:

    Here is my example data:

    find . | xargs md5sum
    fe4ab8e15432161f452e345ff30c68b0 a.txt
    30c68b02161e15435ff52e34f4fe4ab8 b.txt
    30c68b02161e15435ff52e34f4fe4ab8 c.txt
    fe4ab8e15432161f452e345ff30c68b0 d.txt
    fe4ab8e15432161f452e345ff30c68b0 e.txt
    

    This will print the key value pairs grouped by the md5 checksum.

    cat table.txt | awk '{print $1}' | sort | uniq  | xargs -i grep {} table.txt
    30c68b02161e15435ff52e34f4fe4ab8 b.txt
    30c68b02161e15435ff52e34f4fe4ab8 c.txt
    fe4ab8e15432161f452e345ff30c68b0 a.txt
    fe4ab8e15432161f452e345ff30c68b0 d.txt
    fe4ab8e15432161f452e345ff30c68b0 e.txt
    
    0 讨论(0)
  • 2020-11-29 16:07

    Pure bash (no fork!)

    There is a way, using a bash function. This way is very quick as there is no fork!...

    ... While bunch of ip addresses stay small!

    countIp () { 
        local -a _ips=(); local _a
        while IFS=. read -a _a ;do
            ((_ips[_a<<24|${_a[1]}<<16|${_a[2]}<<8|${_a[3]}]++))
        done
        for _a in ${!_ips[@]} ;do
            printf "%.16s %4d\n" \
              $(($_a>>24)).$(($_a>>16&255)).$(($_a>>8&255)).$(($_a&255)) ${_ips[_a]}
        done
    }
    

    Note: IP addresses are converted to 32bits unsigned integer value, used as index for array. This use simple bash arrays, not associative array (wich is more expensive)!

    time countIp < ip_addresses 
    10.0.10.1    3
    10.0.10.2    1
    10.0.10.3    1
    real    0m0.001s
    user    0m0.004s
    sys     0m0.000s
    
    time sort ip_addresses | uniq -c
          3 10.0.10.1
          1 10.0.10.2
          1 10.0.10.3
    real    0m0.010s
    user    0m0.000s
    sys     0m0.000s
    

    On my host, doing so is a lot quicker than using forks, upto approx 1'000 addresses, but take approx 1 entire second when I'll try to sort'n count 10'000 addresses.

    0 讨论(0)
提交回复
热议问题