How can I count most occuring sequence of 3 letters within a word with a bash script

后端 未结 3 1239
挽巷
挽巷 2021-01-14 07:31

I have a sample file like

XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant

Here I need to grep

3条回答
  •  执笔经年
    2021-01-14 08:06

    This is an alternative method to the solution of Ed Morton. It is less looping, but needs a bit more memory. The idea is not to care about spaces or any non-alphabetic character. We filter them out in the end.

    awk -v n=3 '{ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
                END {for(s in a) if (s !~ /[^a-z]/) print s,a[s] }' file
    

    When you use GNU awk, you can do this a bit differently and optimized by setting each record to be a word. This way the end selection does not need to happen:

    awk -v n=3 -v RS='[[:space:]]' '
        (length>=n){ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
        END {for(s in a) print s,a[s] }' file
    

提交回复
热议问题