问题
I made a Bash script that extracts words from a text file with grep and sed and then sorts them with sort and counts the repetitions with wc, then sort again by frequency. The example output looks like this:
12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy
Now I'd like to merge all words with the same frequency into one line, like this:
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
Is there any way to do that with Bash and standard Unix toolset? Or I would have to write a script / program in some more sophisticated scripting language?
回答1:
With awk
:
$ echo "12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy" | awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} END {for (e in cnt) print e, cnt[e]} ' | sort -nr
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
You can do something similar with Bash 4 associative arrays. awk
is easier and POSIX though. Use that.
Explanation:
awk
splits the line apart by the separator in FS, in this case the default of horizontal whitespace;$1
is the first field of the count - use that to collect items with the same count in an associative array keyed by the count withcnt[$1]
;cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2
is a ternary assignment - ifcnt[$1]
has no value, just assign the second field$2
to it (The RH of:
). If it does have a previous value, concatenate$2
separated by the value ofOFS
(the LH of:
);- At the end, print out the value of the associative array.
Since awk associative arrays are unordered, you need to sort again by the numeric value of the first column. gawk
can sort internally, but it is just as easy to call sort
. The input to awk does not need to be sorted, so you can eliminate that part of the pipeline.
If you want the digits to be right justified (as your have in your example):
$ awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2}
END {for (e in cnt) printf "%3s %s\n", e, cnt[e]} '
If you want gawk
to sort numerically by descending values, you can add PROCINFO["sorted_in"]="@ind_num_desc"
prior to traversing the array:
$ gawk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2}
END {PROCINFO["sorted_in"]="@ind_num_desc"
for (e in cnt) printf "%3s %s\n", e, cnt[e]} '
回答2:
With single GNU awk expression (without sort
pipeline):
awk 'BEGIN{ PROCINFO["sorted_in"]="@ind_num_desc" }
{ a[$1]=(a[$1])? a[$1]" "$2:$2 }END{ for(i in a) print i,a[i]}' file
The output:
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
Bonus alternative solution using GNU datamash
tool:
datamash -W -g1 collapse 2 <file
The output (comma-separated collapsed fields):
12 the
7 code,with,add
5 quite
3 do,well
1 quick,can,pick,easy
回答3:
awk:
awk '{a[$1]=a[$1] FS $2}!b[$1]++{d[++c]=$1}END{while(i++<c)print d[i],a[d[i]]}' file
sed:
sed -r ':a;N;s/(\b([0-9]+).*)\n\s*\2/\1/;ta;P;D'
回答4:
You start with sorted data, so you only need a new line when the first field changes.
echo "12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy" |
awk '
{
if ($1==last) {
printf(" %s",$2)
} else {
last=$1;
printf("%s%s",(NR>1?"\n":""),$0)
}
}; END {print}'
回答5:
next time you find yourself trying to manipulate text with a combination of grep and sed and shell and..., stop and just use awk instead - the end result will be clearer, simpler, more efficient, more portable, etc...
$ cat file
It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness.
.
$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+" }
{
for (i=1; i<NF; i++) {
word2cnt[tolower($i)]++
}
}
END {
for (word in word2cnt) {
cnt = word2cnt[word]
cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word
printf "%3d %s\n", cnt, word
}
for (cnt in cnt2words) {
words = cnt2words[cnt]
# printf "%3d %s\n", cnt, words
}
}
$
$ awk -f tst.awk file | sort -rn
4 was
4 the
4 of
4 it
2 times
2 age
1 worst
1 wisdom
1 foolishness
1 best
.
$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+" }
{
for (i=1; i<NF; i++) {
word2cnt[tolower($i)]++
}
}
END {
for (word in word2cnt) {
cnt = word2cnt[word]
cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word
# printf "%3d %s\n", cnt, word
}
for (cnt in cnt2words) {
words = cnt2words[cnt]
printf "%3d %s\n", cnt, words
}
}
$
$ awk -f tst.awk file | sort -rn
4 it was of the
2 age times
1 best worst wisdom foolishness
Just uncomment whichever printf
line you like in the above script to get whichever type of output you want. The above will work in any awk on any UNIX system.
回答6:
Using miller's nest
verb:
mlr -p nest --implode --values --across-records -f 2 --nested-fs ' ' file
Output:
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
来源:https://stackoverflow.com/questions/46027733/merging-word-counts-with-bash-and-unix