A possible algorithm for determining whether two strings are anagrams of one another? [closed]

微笑、不失礼 提交于 2019-12-17 12:12:36

问题


I have this idea (using C language) for checking whether two strings formed from ASCII letters are anagrams of one another:

  1. Check if the strings are the same length.

  2. Check if the sum of the ASCII values of all chars is the same for both strings.

  3. Check if the product of the ASCII values of all chars is the same for both strings.

I believe that if all three are correct, then the strings must be anagrams of one another. However, I can't prove it. Can someone help me prove or disprove that this would work?

Thanks!


回答1:


I wrote a quick program to brute-force search for conflicts and found that this approach does not always work. The strings ABFN and AAHM have the same ASCII sum and product, but are not anagrams of one another. Their ASCII sum is 279 and ASCII product is 23,423,400.

There are a lot more conflicts than this. My program, searching over all length-four strings, found 11,737 conflicts.

For reference, here's the C++ source code:

#include <iostream>
#include <map>
#include <string>
#include <vector>
using namespace std;

int main() {
  /* Sparse 2D table where used[sum][prod] is either nothing or is a string
   * whose characters sum to "sum" and whose product is "prod".
   */
  map<int, map<int, string> > used;

  /* List of all usable characters in the string. */
  vector<char> usable;
  for (char ch = 'A'; ch <= 'Z'; ch++) {
    usable.push_back(ch);
  }
  for (char ch = 'a'; ch <= 'z'; ch++) {
    usable.push_back(ch);
  }

  /* Brute-force search over all possible length-four strings.  To avoid
   * iterating over anagrams, the search only explores strings whose letters
   * are in increasing ASCII order.
   */
  for (int a = 0; a < usable.size(); a++) {
    for (int b = a; b < usable.size(); b++) {
      for (int c = b; c < usable.size(); c++) {
        for (int d = c; d < usable.size(); d++) {
          /* Compute the sum and product. */
          int sum  = usable[a] + usable[b] + usable[c] + usable[d];
          int prod = usable[a] * usable[b] * usable[c] * usable[d];

          /* See if we have already seen this. */
          if (used.count(sum) &&
              used[sum].count(prod)) {
            cout << "Conflict found: " << usable[a] << usable[b] << usable[c] << usable[d] << " conflicts with " << used[sum][prod] << endl;
          }

          /* Update the table. */
          used[sum][prod] = string() + usable[a] + usable[b] + usable[c] + usable[d];
        }
      }
    }
  }
}

Hope this helps!




回答2:


Your approach is false; I can't explain why because I don't understand it, but there are different sets at least for cardinality 3 that have the same sum and product: https://math.stackexchange.com/questions/38671/two-sets-of-3-positive-integers-with-equal-sum-and-product




回答3:


The letters a-z and A-Z are used to index an array of 26 primes, and the product of these primes is used as a hash value for the word. Equal product <--> same letters.

(the order of the hashvalues in the primes26[] array in the below fragment is based on the letter frequencies in the Dutch language, as an attempt mimimise the expected product)

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define COUNTOF(a) (sizeof (a)/ sizeof (a)[0])

typedef unsigned long long HashVal;
HashVal hashmem (char *str, size_t len);

unsigned char primes26[] =
{
5,71,79,19,2,83,31,43,11,53,37,23,41,3,13,73,101,17,29,7,59,47,61,97,89,67,
};

struct anahash {
        struct anahash *next;
        unsigned freq;
        HashVal hash;
        char word[1];
        };

struct anahash *hashtab[1024*1024] = {NULL,};
struct anahash *new_word(char *str, size_t len);
struct anahash **hash_find(struct anahash *wp);

/*********************************************/

HashVal hashmem (char *str, size_t len)
{
size_t idx;
HashVal val=1;

if (!len) return 0;
for (idx = 0; idx < len; idx++) {
        char ch = str[idx];
        if (ch >= 'A' && ch <= 'Z' ) val *= primes26[ ch - 'A'];
        else if (ch >= 'a' && ch <= 'z' ) val *= primes26[ ch - 'a'];
        else continue;
        }
return val;
}

struct anahash *new_word(char *str, size_t len)
{
struct anahash *wp;
if (!len) len = strlen(str);

wp = malloc(len + sizeof *wp );
wp->hash = hashmem(str, len);
wp->next = NULL;
wp->freq = 0;
memcpy (wp->word, str, len);
wp->word[len] = 0;
return wp;
}

struct anahash **hash_find(struct anahash *wp)
{
unsigned slot;
struct anahash **pp;

slot = wp->hash % COUNTOF(hashtab);

for (pp = &hashtab[slot]; *pp; pp= &(*pp)->next) {
        if ((*pp)->hash < wp->hash) continue;
        if (strcmp( wp->word, (*pp)->word ) > 0) continue;
        break;
        }
return pp;
}

char buff [16*4096];
int main (void)
{
size_t pos,end;
struct anahash *wp, **pp;
HashVal val;

memset(hashtab, 0, sizeof hashtab);

while (fgets(buff, sizeof buff, stdin)) {
        for (pos=0; pos < sizeof buff && buff[pos]; ) {
                for(end = pos; end < sizeof buff && buff[end]; end++ ) {
                        if (buff[end] < 'A' || buff[end] > 'z') break;
                        if (buff[end] > 'Z' && buff[end] < 'a') break;
                        }
                if (end > pos) {
                        wp = new_word(buff+pos, end-pos);
                        if (!wp) {pos=end; continue; }
                        pp = hash_find(wp);
                        if (!*pp) *pp = wp;
                        else if ((*pp)->hash == wp->hash
                         && !strcmp((*pp)->word , wp->word)) free(wp);
                        else { wp->next = *pp; *pp = wp; }
                        (*pp)->freq +=1;
                        }
                pos = end;
                for(end = pos; end < sizeof buff && buff[end]; end++ ) {
                        if (buff[end] >= 'A' && buff[end] <= 'Z') break;
                        if (buff[end] >= 'z' && buff[end] <= 'a') break;
                        }
                pos = end;
                }
        }
for (pos = 0;  pos < COUNTOF(hashtab); pos++) {
        if (! &hashtab[pos] ) continue;

        for (pp = &hashtab[pos]; wp = *pp; pp = &wp->next) {
                if (val != wp->hash) {
                        fprintf (stdout, "\nSlot:%u:\n", pos );
                        val = wp->hash;
                        }
                fprintf (stdout, "\t%llx:%u:%s\n", wp->hash, wp->freq, wp->word);
                }
        }

return 0;
}



回答4:


Thanks for such a great question! Instead of trying to disprove your proposition altogether, I spent sometime trying to find ways to augment it so it becomes true. I have the sense that if the standard deviations are equal then the two are equal. But instead of testing that far, I do a simpler test and have not found a counter example as yet. Here is what I have tested:

In addition to the conditions you mentioned before,

  • ASCII square-root of the sum of the squares must be equal:

I use the following python program. I have no complete proof, but maybe my response will help. Anyway, take a look.

from math import sqrt

class Nothing:



def equalString( self, strA, strB ):
    prodA, prodB = 1, 1
    sumA, sumB = 0, 0
    geoA, geoB = 0, 0

    for a in strA:
      i = ord( a )
      prodA *= i
      sumA += i
      geoA += ( i ** 2 )
    geoA = sqrt( geoA )

    for b in strB:
      i = ord( b )
      prodB *= i
      sumB += i
      geoB += ( i ** 2 )
    geoB = sqrt( geoB )

    if prodA == prodB and sumA == sumB and geoA == geoB:
      return True
    else:
      return False


  def compareStrings( self ):
    first, last = ord( 'A' ), ord( 'z' )
    for a in range( first, last + 1 ):
      for b in range( a, last + 1 ):
        for c in range( b, last + 1 ):
          for d in range( c, last + 1 ):
            strA = chr( a ) + chr( b ) + chr( c ) + chr( d )
            strB = chr( d ) + chr( c ) + chr( b ) + chr( a )

            if not self.equalString( strA, strB ):
              print "%s and %s should be equal.\n" % ( strA, strB )

    print "Done"



回答5:


If you don't mind modifying the strings, sort each of them and compare the two signatures.



来源:https://stackoverflow.com/questions/14739186/a-possible-algorithm-for-determining-whether-two-strings-are-anagrams-of-one-ano

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!