Counting palindromes in a text file

匆匆过客 提交于 2019-12-08 09:01:30

问题


Having followed this thread BASH Finding palindromes in a .txt file I can't figure out what am I doing wrong with my script.

#!/bin/bash
search() {
tr -d '[[:punct:][:digit:]@]' \
| sed -E -e '/^(.)\1+$/d'      \
| tr -s '[[:space:]]'           \
| tr '[[:space:]]' '\n'
}

search "$1"

paste <(search <"$1") <(search < "$1" | rev)     \
| awk '$1 == $2 && (length($1) >=3) { print $1 }' \
| sort | uniq -c

All im getting from this script is output of the whole text file. I want to only output palindromes >=3 and count them such as

425 did

120 non

etc. My textfile is called sample.txt and everytime i run the script with: cat sample.txt | source palindrome I get message 'bash: : No such file or directory'.


回答1:


Using awk and sed

awk 'function palindrome(str) {len=length(str); for(k=1; k<=len/2+len%2; k++) { if(substr(str,k,1)!=substr(str,len+1-k,1)) return 0 } return 1 } {for(i=1; i<=NF; i++) {if(length($i)>=3){ gsub(/[^a-zA-Z]/,"",$i); if(length($i)>=3) {$i=tolower($i); if(palindrome($i)) arr[$i]++ }} } } END{for(i in arr) print arr[i],i}' file | sed -E '/^[0-9]+ (.)\1+$/d'

Tested on 1.2GB file and execution time was ~4m 40s (i5-6440HQ @ 2.60GHz/4 cores/16GB)

Explanation :

awk '
    function palindrome(str)               # Function to check Palindrome
    {
        len=length(str); 
        for(k=1; k<=len/2+len%2; k++) 
        { 
            if(substr(str,k,1)!=substr(str,len+1-k,1)) 
            return 0 
        } 
        return 1 
    } 

    {
        for(i=1; i<=NF; i++)               # For Each field in a record
        {
            if(length($i)>=3)              # if length>=3
            { 
                gsub(/[^a-zA-Z]/,"",$i);   # remove non-alpha character from it
                if(length($i)>=3)          # Check length again after removal
                {
                    $i=tolower($i);        # Covert to lowercase
                    if(palindrome($i))     # Check if it's palindrome
                        arr[$i]++          # and store it in array
                }
            }
        } 
    } 

    END{for(i in arr) print arr[i],i}' file | sed -E '/^[0-9]+ (.)\1+$/d' 

sed -E '/^[0-9]+ (.)\1+$/d' : From the final result check which strings are composed of just repeated chracters like AAA, BBB etc and remove them.


Old Answer (Before EDIT)

You can try below steps if you want to :

Step 1 : Pre-processing
Remove all unnecessary chars and store the result in temp file

tr -dc 'a-zA-Z\n\t ' <file | tr ' ' '\n' > temp

tr -dc 'a-zA-Z\n\t ' This will remove all except letters,\n,\t, space

tr ' ' '\n' This will convert space to \n to separate each word in newlines

Step-2: Processing

grep -wof temp <(rev temp)  | sed -E -e '/^(.)\1+$/d' | awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }'

grep -wof temp <(rev temp) This will give you all palindromes
-w : Select only those lines containing matches that form whole words. For example : level won't match with levelAAA
-o : Print only the matched group
-f : To use each string in temp file as pattern to search in <(rev temp)

sed -E -e '/^(.)\1+$/d': This will remove words formed of same letters like AAA, BBBBB

awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }' : This will filter words having length>=3 and counts their frequency and finally prints the result

Example :

Input File :

$ cat file
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done

Output:

$ tr -dc 'a-zA-Z\n\t ' <file | tr ' ' '\n' > temp
$ grep -wof temp <(rev temp)  | sed -E -e '/^(.)\1+$/d' | awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }' 
3 dad
3 kayak
3 bob



回答2:


Just a quick Perl alternative:

perl -0nE 'for( /(\w{3,})/g ){ $a{$_}++ if $_ eq reverse($_)}
           END {say "$_ $a{$_}" for keys %a}'
  • in Perl, $_ should be read as "it".
  • for( /(\w{3,})/g ) ... for all relevant words (may need some work to reject false positives like "12a21")
  • if $_ eq reverse($_) ... if it is palindrome
  • END {say "$_ $a{$_}" for...} ... tell us all the its and its number

\thanks{sokowi,batMan}




回答3:


Running the Script

The script expects that the file is given as an argument. The script does not read stdin.

Remove the line search "$1" in the middle of the script. It is not part of the linked answer.

Make the script executable using chmod u+x path/to/palindrome.

Call the script using path/to/palindrome path/to/sample.txt. If all the files are in the current working directory, then the command is

./palindrome sample.txt

Alternative Script

Sometimes the linked script works and sometimes it doesn't. I haven't found out why. However, I wrote an alternative script which does the same and is also a bit cleaner:

#! /bin/bash
grep -Po '\w{3,}' "$1" |  grep -Evw '(.)\1*' | sort > tmp-words
grep -Fwf <(rev tmp-words) tmp-words | uniq -c
rm tmp-words

Save the script, make it executable, and call it with a file as its first argument.



来源:https://stackoverflow.com/questions/46865316/counting-palindromes-in-a-text-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!