Find all words containing characters in UNIX

后端 未结 6 948
误落风尘
误落风尘 2021-02-14 15:18

Given a word W, I want to find all words containing the letters in W from /usr/dict/words. For example, \"bat\" should return \"bat\" and \"tab\" (but not \"table\").

He

相关标签:
6条回答
  • 2021-02-14 15:23

    Here's a shell solution. The best algorithm seems to be #4. It filters out all words that are of incorrect length. Then, it sums the words using a simple substitution cipher (a=1, b=2, A=27, ...). If the sums match, then it will actually do the original sort and compare. On my system, it can churn through ~235k words looking for "bat" in just under 1/2 second. I'm providing all of my solutions so you can see the different approaches.

    Update: not shown, but I also tried putting the sum inside the first bin of the histogram approach I tried, but it was even slower than the histograms without. I thought it would function as a short circuit, but it didn't work.

    Update2: I tried the awk solution and it runs in about 1/3 the time of my best shell solution or ~0.126s versus ~0.490s. The perl solution runs ~1.1s.

    #!/bin/bash
    
    word=$1
    #dict=words
    dict=/usr/share/dict/words
    #dict=/usr/dict/words
    
    alg1() {
      sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
    
      while read line
      do
        sortedLine=`echo $line | grep -o . | sort | tr -d '\n'`
        if [ "$sortedWord" == "$sortedLine" ]
        then
          echo $line
        fi
      done < $dict
    }
    
    check_sorted_versus_not() {
        local word=$1
        local line=`echo $2 | grep -o . | sort | tr -d '\n'`
        if [ "$word" == "$line" ]
        then
            echo $2
        fi
    }
    
    # Filter out all words of incorrect length
    alg2() {
      sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
      grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"
    
      grep "$grep_string" "$dict" | \
      while read line
      do
        sortedLine=`echo $line | grep -o . | sort | tr -d '\n'`
        if [ "$sortedWord" == "$sortedLine" ]
        then
          echo $line
        fi
      done
    }
    
    
    # Create a lot of variables like this:
    # _a=1, _b=2, ... _z=26, _A=27, _B=28, ... _Z=52
    gen_chars() {
    #  [ -n "$GEN_CHARS" ] && return
      GEN_CHARS=1
      local alpha="abcdefghijklmnopqrstuvwxyz"
      local upperalpha=`echo -n $alpha | tr 'a-z' 'A-Z'`
      local both="$alpha$upperalpha"
      for ((i=0; i < ${#both}; i++))
      do
        ACHAR=${both:i:1}
        eval "_$ACHAR=$((i+1))"
      done
    }
    
    # I think it's faster to return the value in a var then to echo it in a sub process.
    # Try summing the word one char at a time by building an arithmetic expression
    # and then evaluate that expression.
    # Requires: gen_chars
    sum_word() {
      SUM=0
      local s=""
      # parsing input one character at a time
      for ((i=0; i < ${#1}; i++))
      do
        ACHAR=${1:i:1}
        s="$s\$_$ACHAR+"
      done
    
      SUM=$(( $(eval echo -n ${s}0) ))
    }
    
    # I think it's faster to return the value in a var then to echo it in a sub process.
    # Try summing the word one char at a time using a case statement.
    sum_word2() {
      SUM=0
      local s=""
      # parsing input one character at a time
      for ((i=0; i < ${#1}; i++))
      do
        ACHAR=${1:i:1}
        case $ACHAR in
        a) SUM=$((SUM+  1));;
        b) SUM=$((SUM+  2));;
        c) SUM=$((SUM+  3));;
        d) SUM=$((SUM+  4));;
        e) SUM=$((SUM+  5));;
        f) SUM=$((SUM+  6));;
        g) SUM=$((SUM+  7));;
        h) SUM=$((SUM+  8));;
        i) SUM=$((SUM+  9));;
        j) SUM=$((SUM+ 10));;
        k) SUM=$((SUM+ 11));;
        l) SUM=$((SUM+ 12));;
        m) SUM=$((SUM+ 13));;
        n) SUM=$((SUM+ 14));;
        o) SUM=$((SUM+ 15));;
        p) SUM=$((SUM+ 16));;
        q) SUM=$((SUM+ 17));;
        r) SUM=$((SUM+ 18));;
        s) SUM=$((SUM+ 19));;
        t) SUM=$((SUM+ 20));;
        u) SUM=$((SUM+ 21));;
        v) SUM=$((SUM+ 22));;
        w) SUM=$((SUM+ 23));;
        x) SUM=$((SUM+ 24));;
        y) SUM=$((SUM+ 25));;
        z) SUM=$((SUM+ 26));;
        A) SUM=$((SUM+ 27));;
        B) SUM=$((SUM+ 28));;
        C) SUM=$((SUM+ 29));;
        D) SUM=$((SUM+ 30));;
        E) SUM=$((SUM+ 31));;
        F) SUM=$((SUM+ 32));;
        G) SUM=$((SUM+ 33));;
        H) SUM=$((SUM+ 34));;
        I) SUM=$((SUM+ 35));;
        J) SUM=$((SUM+ 36));;
        K) SUM=$((SUM+ 37));;
        L) SUM=$((SUM+ 38));;
        M) SUM=$((SUM+ 39));;
        N) SUM=$((SUM+ 40));;
        O) SUM=$((SUM+ 41));;
        P) SUM=$((SUM+ 42));;
        Q) SUM=$((SUM+ 43));;
        R) SUM=$((SUM+ 44));;
        S) SUM=$((SUM+ 45));;
        T) SUM=$((SUM+ 46));;
        U) SUM=$((SUM+ 47));;
        V) SUM=$((SUM+ 48));;
        W) SUM=$((SUM+ 49));;
        X) SUM=$((SUM+ 50));;
        Y) SUM=$((SUM+ 51));;
        Z) SUM=$((SUM+ 52));;
        *) SUM=0; return;;
        esac
      done
    }
    
    # I think it's faster to return the value in a var then to echo it in a sub process.
    # Try summing the word by building an arithmetic expression using sed and then evaluating
    # the expression.
    # Requires: gen_chars
    sum_word3() {
      SUM=$(( $(eval echo -n `echo -n $1 | sed -E -ne 's,.,$_&+,pg'`) 0))
      #echo "SUM($1)=$SUM"
    }
    
    # Filter out all words of incorrect length
    # Sum the characters in the word: i.e. a=1, b=2, ...  and "abbc" = 1+2+2+3 = 8
    alg3() {
      gen_chars
      sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
      sum_word $word
      word_sum=$SUM
      grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"
    
      grep "$grep_string" "$dict" | \
      while read line
      do
        sum_word $line
        line_sum=$SUM
        if [ $word_sum == $line_sum ]
        then
          check_sorted_versus_not $sortedWord $line
        fi
      done
    }
    
    # Filter out all words of incorrect length
    # Sum the characters in the word: i.e. a=1, b=2, ...  and "abbc" = 1+2+2+3 = 8
    # Use sum_word2
    alg4() {
      sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
      sum_word2 $word
      word_sum=$SUM
      grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"
    
      grep "$grep_string" "$dict" | \
      while read line
      do
        sum_word2 $line
        line_sum=$SUM
        if [ $word_sum == $line_sum ]
        then
          check_sorted_versus_not $sortedWord $line
        fi
      done
    }
    
    # Filter out all words of incorrect length
    # Sum the characters in the word: i.e. a=1, b=2, ...  and "abbc" = 1+2+2+3 = 8
    # Use sum_word3
    alg5() {
      gen_chars
      sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
      sum_word3 $word
      word_sum=$SUM
      grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"
    
      grep "$grep_string" "$dict" | \
      while read line
      do
        sum_word3 $line
        line_sum=$SUM
        if [ $word_sum == $line_sum ]
        then
          check_sorted_versus_not $sortedWord $line
        fi
      done
    }
    
    
    # I think it's faster to return the value in a var then to echo it in a sub process.
    # Try summing the word one char at a time using a case statement.
    # Place results in a histogram
    sum_word4() {
      SUM=(0 0 0 0 0 0 0 0 0 0
           0 0 0 0 0 0 0 0 0 0
           0 0 0 0 0 0 
           0 0 0 0 0 0 0 0 0 0
           0 0 0 0 0 0 0 0 0 0
           0 0 0 0 0 0 
           0)
      # parsing input one character at a time
      for ((i=0; i < ${#1}; i++))
      do
        ACHAR=${1:i:1}
        case $ACHAR in
        a) SUM[1]=$((SUM[ 1] + 1));;
        b) SUM[2]=$((SUM[ 2] + 1));;
        c) SUM[3]=$((SUM[ 3] + 1));;
        d) SUM[4]=$((SUM[ 4] + 1));;
        e) SUM[5]=$((SUM[ 5] + 1));;
        f) SUM[6]=$((SUM[ 6] + 1));;
        g) SUM[7]=$((SUM[ 7] + 1));;
        h) SUM[8]=$((SUM[ 8] + 1));;
        i) SUM[9]=$((SUM[ 9] + 1));;
        j) SUM[10]=$((SUM[10] + 1));;
        k) SUM[11]=$((SUM[11] + 1));;
        l) SUM[12]=$((SUM[12] + 1));;
        m) SUM[13]=$((SUM[13] + 1));;
        n) SUM[14]=$((SUM[14] + 1));;
        o) SUM[15]=$((SUM[15] + 1));;
        p) SUM[16]=$((SUM[16] + 1));;
        q) SUM[17]=$((SUM[17] + 1));;
        r) SUM[18]=$((SUM[18] + 1));;
        s) SUM[19]=$((SUM[19] + 1));;
        t) SUM[20]=$((SUM[20] + 1));;
        u) SUM[21]=$((SUM[21] + 1));;
        v) SUM[22]=$((SUM[22] + 1));;
        w) SUM[23]=$((SUM[23] + 1));;
        x) SUM[24]=$((SUM[24] + 1));;
        y) SUM[25]=$((SUM[25] + 1));;
        z) SUM[26]=$((SUM[26] + 1));;
        A) SUM[27]=$((SUM[27] + 1));;
        B) SUM[28]=$((SUM[28] + 1));;
        C) SUM[29]=$((SUM[29] + 1));;
        D) SUM[30]=$((SUM[30] + 1));;
        E) SUM[31]=$((SUM[31] + 1));;
        F) SUM[32]=$((SUM[32] + 1));;
        G) SUM[33]=$((SUM[33] + 1));;
        H) SUM[34]=$((SUM[34] + 1));;
        I) SUM[35]=$((SUM[35] + 1));;
        J) SUM[36]=$((SUM[36] + 1));;
        K) SUM[37]=$((SUM[37] + 1));;
        L) SUM[38]=$((SUM[38] + 1));;
        M) SUM[39]=$((SUM[39] + 1));;
        N) SUM[40]=$((SUM[40] + 1));;
        O) SUM[41]=$((SUM[41] + 1));;
        P) SUM[42]=$((SUM[42] + 1));;
        Q) SUM[43]=$((SUM[43] + 1));;
        R) SUM[44]=$((SUM[44] + 1));;
        S) SUM[45]=$((SUM[45] + 1));;
        T) SUM[46]=$((SUM[46] + 1));;
        U) SUM[47]=$((SUM[47] + 1));;
        V) SUM[48]=$((SUM[48] + 1));;
        W) SUM[49]=$((SUM[49] + 1));;
        X) SUM[50]=$((SUM[50] + 1));;
        Y) SUM[51]=$((SUM[51] + 1));;
        Z) SUM[52]=$((SUM[52] + 1));;
        *) SUM[53]=-1; return;;
        esac
      done
    
     #echo ${SUM[*]}
    }
    
    # Check if two histograms are equal
    hist_are_equal() {
      # Array sizes differ?
      [ ${#_h1[*]} != ${#SUM[*]} ] && return 1
    
      # parsing input one index at a time
      for ((i=0; i < ${#_h1[*]}; i++))
      do
        [ ${_h1[i]} != ${SUM[i]} ] && return 1
      done
    
      return 0
    }
    
    # Check if two histograms are equal
    hist_are_equal2() {
      # Array sizes differ?
      local size=${#_h1[*]}
      [ $size != ${#SUM[*]} ] && return 1
    
      # parsing input one index at a time
      for ((i=0; i < $size; i++))
      do
        [ ${_h1[i]} != ${SUM[i]} ] && return 1
      done
    
      return 0
    }
    
    # Filter out all words of incorrect length
    # Use sum_word4 which generates a histogram of character frequency
    alg6() {
      sum_word4 $word
      _h1=${SUM[*]}
      grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"
    
      grep "$grep_string" "$dict" | \
      while read line
      do
        sum_word4 $line
        if hist_are_equal
        then
          echo $line
        fi
      done
    }
    
    # Filter out all words of incorrect length
    # Use sum_word4 which generates a histogram of character frequency
    alg7() {
      sum_word4 $word
      _h1=${SUM[*]}
      grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"
    
      grep "$grep_string" "$dict" | \
      while read line
      do
        sum_word4 $line
        if hist_are_equal2
        then
          echo $line
        fi
      done
    }
    
    run_test() {
      echo alg$1
      eval time alg$1
    }
    
    #run_test 1
    #run_test 2
    #run_test 3
    run_test 4
    #run_test 5
    run_test 6
    #run_test 7
    
    0 讨论(0)
  • 2021-02-14 15:32
    #!/usr/bin/perl
    $myword=join("", sort split (//, $ARGV[0]));
    shift;
    while (<>) {
        chomp;
        print "$_\n" if (join("", sort split (//)) eq $myword);
    }
    

    Use it like this: bla.pl < /usr/dict/words searchword

    0 讨论(0)
  • 2021-02-14 15:34

    here's an awk implementation. It finds the words with those letters in "W".

    dict="/usr/share/dict/words"
    word=$1
    awk -vw="$word" 'BEGIN{
      m=split(w,c,"")
      for(p=1;p<=m;p++){ chars[c[p]]++ }
    }
    length($0)==length(w){
      f=0;g=0
      n=split($0,t,"")
      for(o=1;o<=n;o++){
        if (!( t[o] in chars) ){
           f=1; break
        }else{ st[t[o]]++ }
      }
      if (!f || $0==w){
          for(z in st){
            if ( st[z] != chars[z] ) { g=1 ;break}
          }
          if(!g){ print "found: "$0 }
      }
      delete st
    }' $dict
    

    output

    $ wc -l < /usr/share/dict/words
    479829
    
    $ time ./shell.sh look
    found: kolo
    found: look
    
    real    0m1.361s
    user    0m1.074s
    sys     0m0.015s
    

    Update: change of algorithm, using sorting

    dict="/usr/share/dict/words"
    awk 'BEGIN{
      w="table"
      m=split(w,c,"")
      b=asort(c,chars)
    }
    length($0)==length(w){
      f=0
      n=split($0,t,"")
      e=asort(t,d)
      for(i=1;i<=e;i++) {
        if(d[i]!=chars[i]){
            f=1;break
        }
      }
      if(!f) print $0
    }' $dict
    

    output

    $ time ./shell.sh #looking for table
    ablet
    batel
    belat
    blate
    bleat
    tabel
    table
    
    real    0m1.416s
    user    0m1.343s
    sys     0m0.014s
    
    $ time ./shell.sh #looking for chairs
    chairs
    ischar
    rachis
    
    real    0m1.697s
    user    0m1.660s
    sys     0m0.014s
    
    $ time perl perl.pl #using beamrider's Perl script
    table
    tabel
    ablet
    batel
    blate
    bleat
    belat
    
    real    0m2.680s
    user    0m1.633s
    sys     0m0.881s
    
    $ time perl perl.pl # looking for chairs
    chairs
    ischar
    rachis
    
    real    0m14.044s
    user    0m8.328s
    sys     0m5.236s
    
    0 讨论(0)
  • You want to find words containing only a given set of characters. A regex for that would be:

    '^[letters_you_care_about]*$'
    

    So, you could do:

    grep "^[$W]*$" /usr/dict/words
    

    The '^' matches the beginning of the line; '$' is for the end of the line. This means we must have an exact match, not just a partial match (e.g. "table").

    '[' and ']' are used to define a group of possible characters allowed in one character space of the input file. We use this to find words in /usr/dict/word that only contain the characters in $W.

    The '*' repeats the previous character (the '[...]' rule), which says to find a word of any length, where all the characters are in $W.

    0 讨论(0)
  • 2021-02-14 15:41

    This utility might interest you:

    an -w "tab" -m 3
    

    ...gives bat and tab only.

    The original author seems to not be around any more, but you can find information at http://packages.qa.debian.org/a/an.html (even if you don't want to use it itself, the source might be worth a look).

    0 讨论(0)
  • 2021-02-14 15:42

    So we have the following:

    n = length of input word
    L = lines in dictionary file

    If n tends to be small and L tends to be huge, might we be better off finding all permutations of the input word and looking for those, rather than doing something (like sorting) to all L lines of the dictionary file? (Actually, since finding all permutations of a word is O(n!), and we have to run through the entire dictionary file once for each word, maybe not, but I wrote the code anyway.)

    This is Perl - I know you wanted command-line operations but I don't have a way to do that in shell script that's not super-hacky:

    sub dedupe {
        my (@list) = @_;
        my (@new_list, %seen_entries, $entry);
    
        foreach $entry (@list) {
            if (!(defined($seen_entries{$entry}))) {
                push(@new_list, $entry);
                $seen_entries{$entry} = 1;
            }
        }
    
        return @new_list;
    }
    
    sub find_all_permutations {
        my ($word) = @_;
        my (@permutations, $subword, $letter, $rest_of_word, $i);
    
        if (length($word) == 1) {
            push(@permutations, $word);
        } else {   
            for ($i=0; $i<length($word); $i++) {
                $letter = substr($word, $i, 1);
                $rest_of_word = substr($word, 0, $i) . substr($word, $i + 1);
                foreach $subword (find_all_permutations($rest_of_word)) {
                    push(@permutations, $letter . $subword);
                }            
            }
        }
    
        return @permutations;
    }
    
    $words_file = '/usr/share/dict/words';
    $word = 'table';
    
    @all_permutations = dedupe(find_all_permutations($word));
    foreach $permutation (@all_permutations) {
        if (`grep -c -m 1 ^$permutation\$ $words_file` == 1) {
            print $permutation . "\n";
        }
    }
    
    0 讨论(0)
提交回复
热议问题