Add frequency (number of occurrences) to my table of text through awk

前端 未结 2 1722
陌清茗
陌清茗 2021-01-28 19:07

Given this input table:

pac1 xxx 
pac1 yyy
pac1 zzz
pac2 xxx
pac2 uuu
pac3 zzz
pac3 uuu
pac4 zzz

I need to add frequencies to third column like

相关标签:
2条回答
  • 2021-01-28 19:37

    If you want to use awk, then you will want to run through every line, and collect some information using three associative arrays. One to collect the raw data, one to count the instances of column 2 duplication, and one to count the instances of column 3 duplication. Then, have an END { for (item in data_array)} that walks through the data array, splitting the fields to get values to use as indices for the other two arrays and printing each line with the appropriate frequency. Something like:

    awk '{ data[num++] = $0;
           col1[$1]++;
           col2[$2]++
         }
       END { for (i = 0; i < num; i++) {
            split(data[i], field)
            printf "%s %d/%d\n", data[i], col2[field[2]], col1[field[1]]
           }
        }' < input.file
    

    This only requires reading the file once, and can be extended for other columns and counts. The for loop causes the data to be displayed it the same order in which it was collected.

    Look at man awk for information on associative arrays, splitting a string, and for.

    0 讨论(0)
  • 2021-01-28 19:38

    Just read the file twice: first to count the values and store them in an array, then to print its values:

    $ awk 'FNR==NR {col1[$1]++; col2[$2]++; next} {print $0, col2[$2] "/" col1[$1]}' file file
    pac1 xxx 2/3
    pac1 yyy 1/3
    pac1 zzz 3/3
    pac2 xxx 2/2
    pac2 uuu 2/2
    pac3 zzz 3/2
    pac3 uuu 2/2
    pac4 zzz 3/1
    

    The FNR==NR {things; next} is a trick to do things just when reading the first file. It is based on using FNR and NR: the former means Field Number of Record and the latter Number of Record. This means that FNR contains the number of line of the current file, while NR contains the number of lines that have been read so far overall, making FNR==NR true just when reading the first file. By adding next we skip the current line and jump to the next one.

    Find more info in Idiomatic awk.


    Regarding your update: if you want the last item to contain the count of different values in the first column, just check the length of the array that was created. This will tell you many different indexes it contains, and hence the value you want:

    $ awk 'FNR==NR {col1[$1]++; col2[$2]++; next} {print $0, col2[$2] "/" length(col1)}' file file
    pac1 xxx 2/4
    pac1 yyy 1/4
    pac1 zzz 3/4
    pac2 xxx 2/4
    pac2 uuu 2/4
    pac3 zzz 3/4
    pac3 uuu 2/4
    pac4 zzz 3/4
    
    0 讨论(0)
提交回复
热议问题