Merging word counts with Bash and Unix

旧巷老猫 提交于 2021-02-10 19:51:53

问题


I made a Bash script that extracts words from a text file with grep and sed and then sorts them with sort and counts the repetitions with wc, then sort again by frequency. The example output looks like this:

12 the
 7 code
 7 with
 7 add
 5 quite
 3 do
 3 well
 1 quick
 1 can
 1 pick
 1 easy

Now I'd like to merge all words with the same frequency into one line, like this:

12 the
 7 code with add
 5 quite
 3 do well
 1 quick can pick easy

Is there any way to do that with Bash and standard Unix toolset? Or I would have to write a script / program in some more sophisticated scripting language?


回答1:


With awk:

$ echo "12 the
 7 code
 7 with
 7 add
 5 quite
 3 do
 3 well
 1 quick
 1 can
 1 pick
 1 easy" | awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} END {for (e in cnt) print e, cnt[e]} ' | sort -nr
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy

You can do something similar with Bash 4 associative arrays. awk is easier and POSIX though. Use that.


Explanation:

  1. awk splits the line apart by the separator in FS, in this case the default of horizontal whitespace;
  2. $1 is the first field of the count - use that to collect items with the same count in an associative array keyed by the count with cnt[$1];
  3. cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2 is a ternary assignment - if cnt[$1] has no value, just assign the second field $2 to it (The RH of :). If it does have a previous value, concatenate $2 separated by the value of OFS (the LH of :);
  4. At the end, print out the value of the associative array.

Since awk associative arrays are unordered, you need to sort again by the numeric value of the first column. gawk can sort internally, but it is just as easy to call sort. The input to awk does not need to be sorted, so you can eliminate that part of the pipeline.


If you want the digits to be right justified (as your have in your example):

$ awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} 
     END {for (e in cnt) printf "%3s %s\n", e, cnt[e]} '

If you want gawk to sort numerically by descending values, you can add PROCINFO["sorted_in"]="@ind_num_desc" prior to traversing the array:

$ gawk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} 
            END {PROCINFO["sorted_in"]="@ind_num_desc"
               for (e in cnt) printf "%3s %s\n", e, cnt[e]} '



回答2:


With single GNU awk expression (without sort pipeline):

awk 'BEGIN{ PROCINFO["sorted_in"]="@ind_num_desc" }
     { a[$1]=(a[$1])? a[$1]" "$2:$2 }END{ for(i in a) print i,a[i]}' file

The output:

12 the
7 code with add
5 quite
3 do well
1 quick can pick easy

Bonus alternative solution using GNU datamash tool:

datamash -W -g1 collapse 2 <file

The output (comma-separated collapsed fields):

12  the
7   code,with,add
5   quite
3   do,well
1   quick,can,pick,easy



回答3:


awk:

awk '{a[$1]=a[$1] FS $2}!b[$1]++{d[++c]=$1}END{while(i++<c)print d[i],a[d[i]]}' file

sed:

sed -r ':a;N;s/(\b([0-9]+).*)\n\s*\2/\1/;ta;P;D'



回答4:


You start with sorted data, so you only need a new line when the first field changes.

echo "12 the
 7 code
 7 with
 7 add
 5 quite
 3 do
 3 well
 1 quick
 1 can
 1 pick
 1 easy" |
awk '
   {
      if ($1==last) { 
         printf(" %s",$2) 
      } else { 
         last=$1;
         printf("%s%s",(NR>1?"\n":""),$0)
      }
    }; END {print}'



回答5:


next time you find yourself trying to manipulate text with a combination of grep and sed and shell and..., stop and just use awk instead - the end result will be clearer, simpler, more efficient, more portable, etc...

$ cat file
It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness.

.

$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+" }
{
    for (i=1; i<NF; i++) {
        word2cnt[tolower($i)]++
    }
}
END {
    for (word in word2cnt) {
        cnt = word2cnt[word]
        cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word
        printf "%3d %s\n", cnt, word
    }
    for (cnt in cnt2words) {
        words = cnt2words[cnt]
        # printf "%3d %s\n", cnt, words
    }
}
$
$ awk -f tst.awk file | sort -rn
  4 was
  4 the
  4 of
  4 it
  2 times
  2 age
  1 worst
  1 wisdom
  1 foolishness
  1 best

.

$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+" }
{
    for (i=1; i<NF; i++) {
        word2cnt[tolower($i)]++
    }
}
END {
    for (word in word2cnt) {
        cnt = word2cnt[word]
        cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word
        # printf "%3d %s\n", cnt, word
    }
    for (cnt in cnt2words) {
        words = cnt2words[cnt]
        printf "%3d %s\n", cnt, words
    }
}
$
$ awk -f tst.awk file | sort -rn
  4 it was of the
  2 age times
  1 best worst wisdom foolishness

Just uncomment whichever printf line you like in the above script to get whichever type of output you want. The above will work in any awk on any UNIX system.




回答6:


Using miller's nest verb:

mlr -p  nest --implode --values --across-records -f 2 --nested-fs ' ' file

Output:

12 the
7 code with add
5 quite
3 do well
1 quick can pick easy


来源:https://stackoverflow.com/questions/46027733/merging-word-counts-with-bash-and-unix

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!