How can I count most occuring sequence of 3 letters within a word with a bash script

后端 未结 3 1240
挽巷
挽巷 2021-01-14 07:31

I have a sample file like

XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant

Here I need to grep

相关标签:
3条回答
  • 2021-01-14 07:56

    Here's how to get started with what I THINK you're trying to do:

    $ cat tst.awk
    BEGIN { stringLgth = 3 }
    {
        for (fldNr=1; fldNr<=NF; fldNr++) {
            field = $fldNr
            fieldLgth = length(field)
            if ( fieldLgth >= stringLgth ) {
                maxBegPos = fieldLgth - (stringLgth - 1)
                for (begPos=1; begPos<=maxBegPos; begPos++) {
                    string = tolower(substr(field,begPos,stringLgth))
                    cnt[string]++
                }
            }
        }
    }
    END {
        for (string in cnt) {
            print string, cnt[string]
        }
    }
    

    .

    $ awk -f tst.awk file | sort -k2,2nr
    acc 5
    cou 5
    cco 4
    ing 4
    nti 4
    oun 4
    tin 4
    unt 4
    aco 3
    abc 1
    ant 1
    any 1
    bca 1
    cac 1
    cal 1
    com 1
    con 1
    fir 1
    ica 1
    irm 1
    lta 1
    mpa 1
    nsu 1
    omp 1
    ons 1
    ous 1
    pan 1
    sti 1
    sul 1
    tan 1
    tic 1
    ult 1
    ust 1
    xyz 1
    yza 1
    zac 1
    
    0 讨论(0)
  • 2021-01-14 08:06

    This is an alternative method to the solution of Ed Morton. It is less looping, but needs a bit more memory. The idea is not to care about spaces or any non-alphabetic character. We filter them out in the end.

    awk -v n=3 '{ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
                END {for(s in a) if (s !~ /[^a-z]/) print s,a[s] }' file
    

    When you use GNU awk, you can do this a bit differently and optimized by setting each record to be a word. This way the end selection does not need to happen:

    awk -v n=3 -v RS='[[:space:]]' '
        (length>=n){ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
        END {for(s in a) print s,a[s] }' file
    
    0 讨论(0)
  • 2021-01-14 08:10

    This might work for you (GNU sed, sort and uniq):

    sed -E 's/.(..)/\L&\n\1/;/^\S{3}/P;D' file |
    sort |
    uniq -c |
    sort -s -k1,1rn |
    sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p'
    

    Use the first sed invocation to output 3 letter lower case words.

    Sort the words.

    Count the duplicates.

    Sort the counts in reverse numerical order maintaining the alphabetical order.

    Use the second sed invocation to manipulate the results into the desired format.


    If you only want lines with duplicates and in alphabetical order and case wise, use:

    sed -E 's/.(..)/&\n\1/;/^\S{3}/P;D' file |
    sort |
    uniq -cd |
    sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p
    
    0 讨论(0)
提交回复
热议问题