Optimizing very often used anagram function

问题

I have written a function that determines whether two words are anagrams. Word A is an anagram of word B if you can build word B out of A just by rearranging the letters, e.g.:

lead is anagram of deal

This is my function:

bool is_anagram(std::string const & s1, std::string const & s2)
{
    auto check = [](std::string const & x)
    {
        std::map<char, unsigned> counter;
        for(auto const & c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}

This works fine, but as the number of words increases (and this function is used several million times within my application), it very soon became a major bottleneck of my application.

Does anyone have an idea of how to speed this function up?

回答1:

The map creation and your call to std::map::find within the iteration, is quite expensive.

In this case, you can use the fact that std::string behaves in many regards like a std::vector<char>, meaning that you can sort it using std::sort:

bool is_anagram(std::string s1, std::string s2)
{
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}

Instead of the two maps that you are creating, I am taking a copy of the strings (passing them by value instead of const reference) and sorting them, so

sort("lead") => "adel"
sort("deal") => "adel"

This change should already speed your algorithm up by quite a bit. One more thing that may greatly affect the performance if you tend to compare arbitrary words:

bool is_anagram(std::string s1, std::string s2)
{
    if(s1.length() != s2.length())
        return false;
    /* as above */
}

If the length of the two strings differs, it obviously cannot be an anagram. std::string::length() is a very fast operation (it needs to store the string size anyway), so we save us the hustle of O(N*log(N)) from sorting the two strings.

回答2:

You've got 26 characters, make one array of the size of 26, then iterate through both words and as you encouter characters in words, increment a corresponding array element for characters in the first word and decrement corresponding array element for charaacters in the second word. If the words are anagrams, all array elements will be 0 in the end. The complexity will be just O(N) where N is the length of words.

回答3:

Here's a selection of functions that perform anagram analysis:

#include <iostream>
#include <iomanip>
#include <map>
#include <cctype>
#include <string>
#include <algorithm>

using namespace std;

bool is_anagram(const std::string & s1, const std::string & s2)
{
    auto check = [](const std::string& x)
    {
        std::map<char, unsigned> counter;
        for(auto c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}


bool is_anagram1(const std::string & s1, const std::string & s2)
{
    auto check = [](const std::string& x)
    {
        std::map<char, unsigned> counter;
        for(auto c : x)
        {
        ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}


bool is_anagram2(std::string s1, std::string s2)
{
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}


bool is_anagram3(std::string s1, std::string s2)
{
    if (s1.length() != s2.length()) return false;
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}

bool is_anagram4(const std::string &s1, const std::string &s2)
{
    int arr[256] = {};
    if (s1.length() != s2.length()) return false;
    for(std::string::size_type i = 0; i < s1.length(); i++)
    {
    arr[(unsigned)s1[i]]++;
    arr[(unsigned)s2[i]]--;
    }
    for(auto i : arr)
    {
    if (i) return false;
    } 
    return true;
}

bool is_anagram5(const std::string &s1, const std::string &s2)
{
    int arr[26] = {};
    if (s1.length() != s2.length()) return false;

    for(std::string::size_type i = 0; i < s1.length(); i++)
    {
    arr[(unsigned)tolower(s1[i])-'a']++;
    arr[(unsigned)tolower(s2[i])-'a']--;
    }
    for(auto i : arr)
    {
    if (i) return false;
    } 
    return true;
}


static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}


template<typename T>
void bm(const char *name, T func, 
    const std::string &s1, const std::string &s2)
{
    unsigned long long time = rdtsc();
    bool res = func(s1, s2);
    time = rdtsc()-time;
    cout << "Function" << left << setw(15) << name 
     << " time: " << right << setw(8) << time 
     << " Res: " << res << endl;
}


#define BM(x) bm(#x, x, s1, s2)

int main()
{
    const std::string short1 = "lead";
    const std::string short2 = "deal";
    const std::string long1 = "leaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddeal";
    const std::string long2 = "dealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddeallead";

    cout << "Testing with short strings:" << endl;
    std::string s1 = short1;
    std::string s2 = short2;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);

    cout << "Testing with long strings:" << endl;
    s1 = long1;
    s2 = long2;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);

    cout << "Testing with inequal short string:" << endl;
    s1 = short1;
    s2 = short2;
    s1[s1.length()-1]++;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);

    cout << "Testing with inequal long string:" << endl;
    s1 = long1;
    s2 = long2;
    s1[s1.length()-1]++;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);
}

And the results:

Testing with short strings:
Functionis_anagram      time:    72938 Res: 1
Functionis_anagram1     time:    15694 Res: 1
Functionis_anagram2     time:    49780 Res: 1
Functionis_anagram3     time:    10424 Res: 1
Functionis_anagram4     time:     4272 Res: 1
Functionis_anagram5     time:    18653 Res: 1
Testing with long strings:
Functionis_anagram      time:    46067 Res: 1
Functionis_anagram1     time:    33597 Res: 1
Functionis_anagram2     time:    52721 Res: 1
Functionis_anagram3     time:    41570 Res: 1
Functionis_anagram4     time:     9027 Res: 1
Functionis_anagram5     time:    15062 Res: 1
Testing with inequal short string:
Functionis_anagram      time:    11096 Res: 0
Functionis_anagram1     time:    10115 Res: 0
Functionis_anagram2     time:    13022 Res: 0
Functionis_anagram3     time:     8066 Res: 0
Functionis_anagram4     time:     2963 Res: 0
Functionis_anagram5     time:     1916 Res: 0
Testing with inequal long string:
Functionis_anagram      time:    44246 Res: 0
Functionis_anagram1     time:    33961 Res: 0
Functionis_anagram2     time:    45004 Res: 0
Functionis_anagram3     time:    38896 Res: 0
Functionis_anagram4     time:     6455 Res: 0
Functionis_anagram5     time:    14603 Res: 0

It is quite clear that the straightforward checking with an array counting up/down based on the occurrence of each character is the winner. I took the original code and removed the find, and just used the fact that a map::operator[] will create a zero value if there is no entry there in is_anagram1, which shows some decent improvement too.

Results are from MY machine, these may or may not reflect the results on other machines, but I doubt that the "winner vs loser" is going to be significantly different.

Edit:

Thought about the "find" operation, and decided to use it in different way:

bool is_anagram0(const std::string & s1, const std::string & s2)
{
    auto check = [](const std::string& x)
    {
        std::map<char, unsigned> counter;
        for(auto const &c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                it->second++;
        }
        return counter;
    };

    return check(s1) == check(s2);
}

Gives a little improvement over the original function, but not better enough to compete with the array solution that gives the best result.

回答4:

In addition to all the other suggestions, if you are trying to find pairs of anagrams in a set, you will be calling is_anagram on the same arguments (although not the same pairs of arguments) repeatedly.

Most solutions consist of two steps:

map the string arguments to some other object (a sorted string, an array of length 26, a map as in your own solution)
compare these objects to each other

If you have a set of N strings, and you want to find all pairs which are anagrams of each other, you will be calling is_anagram O(N^2) times.

You can save a lot by first computing the step 1 above for each of the N strings and then comparing the results.

回答5:

I propose a solution which is sorting only one of the two strings:

 bool is_anagram(std::string const & s1, std::string s2)
 {
     if (s1.length() != s2.length()) return false;
     for (int i=0; i<s1.length(); ++i)
     {
         size_t pos = s2.find(s1[i], i);
         if (pos == std::string::npos)
         {
             return false;
         }
         std::swap(s2[i], s2[pos]);
     }
     return true;
 }

This solution could be advantageous in situations where you have one character in the one string which isn't in the other one - because if it doesn't find that character in the other string, it can short-cut the evaluation (in contrast to all other solutions shown here, where always both full strings are evaluated). Especially if this exclusive character on one side occurs early in a long string...

One advantage over the sort solution is also required storage space - the sort solution requires the two strings to be copied, while in my solution only one copy is created. For shorter string lengths (depending on the int type used for counting, but at least up to 256 characters), it also requires less memory than the "count up/down" solution.

For longer strings which are anagrams, on the other hand, it starts to fall behind the "count up/down" solution.

Analysis

See the code & results below. As can easily be seen, my solution (is_anagram3) is quite fast for short anagrams (only beaten by the actually not fully functionally equivalent 26 character count up/down method), and is also fastest for the long non-anagram case with non-matching characters; but tends to get slower than the count up/down methods for longer strings which are anagrams, or for long non-anagrams which just differ by character counts.

Summary

In the end, it would really depend on the expected input data as to what the ideal solution is:

For single calls, the "count up/down" solutions will usually give the best performance in many cases.
Depending on the circumstances, (e.g. with what probability the strings contain different characters, as mentioned above), my solution might be faster
Not tested yet, but seems sure: For the case when you have many anagram checks to perform, and when you implement a cache for the sorted strings, the sorting and comparing solution will become the fastest.

Code:

#include <string>
#include <map>
#include <algorithm>

#include <sys/time.h>
#include <iostream>
#include <iomanip>

bool is_anagram(std::string const & s1, std::string const & s2)
{
    auto check = [](std::string const & x)
    {
        std::map<char, unsigned> counter;
        for(auto const & c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}

bool is_anagram2(std::string s1, std::string s2)
{
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}

bool is_anagram3(std::string const & s1, std::string s2)
{
    if (s1.length() != s2.length()) return false;

    for (int i=0; i<s1.length(); ++i)
    {
        size_t pos = s2.find(s1[i], i);
        if (pos == std::string::npos)
        {
            return false;
        }
        std::swap(s2[i], s2[pos]);
    }
    return true;
}

bool is_anagram4(std::string const & s1, std::string const & s2)
{
    if (s1.length() != s2.length()) return false;

    int count[26] = {0};
    for (int i=0; i<s1.length(); ++i)
    {
        count[s1[i]-'a']++;
        count[s2[i]-'a']--;
    }
    for (int i=0; i<26; ++i)
    {
        if (count[i] != 0) return false;
    }
    return true;
}

bool is_anagram5(std::string const & s1, std::string const & s2)
{
    if (s1.length() != s2.length()) return false;

    int count[256] = {0};
    for (int i=0; i<s1.length(); ++i)
    {
        count[s1[i]]++;
        count[s2[i]]--;
    }
    for (int i=0; i<256; ++i)
    {
        if (count[i] != 0) return false;
    }
    return true;
}

template<typename T>
bool test(const char* name, T func)
{
    if (!func("deal", "lead") || !func("lead", "deal") || !func("dealbreaker", "breakdealer") ||
        !func("dealbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealer") ||
        func("dealbreakerdealbreakerdealbreakerdealbreakera", "breakdealerbreakdealerbreakdealerbreakdealerb") ||
        func("dealxbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealerb") ||
        func("abcdefg", "tuvwxyz") ||
        func("lot", "bloat") || func("lead", "deala") ||
        func("lot", "bloat") || func("deala", "lead") ||
        func("abc", "abcd"))
    {
        std::cout << name << ": impl doesn't work" << std::endl;
        return false;
    }
    return true;
}

template<typename T>
void measure(const char* name, T func,
    std::string const & s1, std::string const & s2)
{
    timeval start,end;
    unsigned long secDiff;
    long usecDiff;

    gettimeofday(&start, 0);
    for (int i=0; i<100000; ++i)
    {
        func(s1, s2);
    }
    gettimeofday(&end, 0);
    secDiff = end.tv_sec - start.tv_sec;
    usecDiff = end.tv_usec - start.tv_usec;
    if (usecDiff < 0)
    {
        secDiff--;
        usecDiff += 1000000;
    }
 std::cout << name << ": " << secDiff << "."<< std::setw(3) << std::setfill('0') << (usecDiff/1000) << " s" << std::endl;
}

template <typename T>
void run(const char * funcName, T func)
{
    std::cout << funcName << std::endl;
    if (!test(funcName, func)) {
        return;
    }
    measure("short", func, "dealbreaker", "breakdealer");
    measure("short no anagram", func, "abcdefg", "tuvwxyz");
    measure("long", func, "dealbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealer");
    measure("long no anagram", func, "dealbreakerdealbreakerdealbreakerdealbreakera", "breakdealerbreakdealerbreakdealerbreakdealerb");
    measure("long no anagram nonmatching char", func, "dealxbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealerb");
}

int main(int argv, char**argc)
{
    run("is_anagram", is_anagram);
    run("is_anagram2", is_anagram2);
    run("is_anagram3", is_anagram3);
    run("is_anagram4", is_anagram4);
    run("is_anagram5", is_anagram5);
}

Output

Measured on my slow Atom machine, results may vary a bit on different systems, but should in general give a good picture of the relative performances:

is_anagram
short: 2.960 s
short no anagram: 2.154 s
long: 8.063 s
long no anagram: 8.092 s
long no anagram nonmatching char: 8.267 s
is_anagram2
short: 0.685 s
short no anagram: 0.333 s
long: 3.332 s
long no anagram: 3.557 s
long no anagram nonmatching char: 3.740 s
is_anagram3
short: 0.280 s
short no anagram: 0.022 s
long: 0.984 s
long no anagram: 0.994 s
long no anagram nonmatching char: 0.140 s
is_anagram4
short: 0.123 s
short no anagram: 0.071 s
long: 0.399 s
long no anagram: 0.389 s
long no anagram nonmatching char: 0.390 s
is_anagram5
short: 0.283 s
short no anagram: 0.145 s
long: 0.551 s
long no anagram: 0.454 s
long no anagram nonmatching char: 0.455 s

回答6:

Assuming that most word pairs are not anagrams, the case you most need to optimise is the failure case.

Obviously if the lengths are different then the strings cannot be anagrams, and the length is a property that is computed once when the string is created, making it a very efficient thing to compare as a fast-out.

If you change your string class to include more of these properties you can improve the accuracy of the fast-out case. Rather than beginning the test function with:

if (s1.length() != s2.length()) return false;

You could use:

if (s1.hash != s2.hash) return false;

and when you create the string (or after you modify it), you would compute a value for hash which is unique (or nearly unique) to all strings with that set of letters in arbitrary order.

You only compute this hash when a string changes; not for every comparison that you do.

A very simple way to compute the hash is:

long sum = 0;
for (int i = 0; i < s.length(); i++)
    sum += s[i];
s.hash = sum;

this is quick to compute, but is prone to collisions.

A more advanced method:

static const unsigned int primetable[256] = { 1, 3, 5, 7, /* ... */ };

unsigned long prod = 1;
for (int i = 0; i < s.length(); i++)
    prod *= primetable[s[i]];
s.hash = prod;

Note that two is omitted from the list of primes. This ensures that prod is always co-prime with the possible range of unsigned long. This maximises the distribution of results in the face of numerical overflow.

If the table is arranged to put small primes at frequent letter positions, then the cases where overflow occurs (which can lead to hash collisions) can be minimised. If there's no overflow then you have a perfect hash (for these purposes) and you would have absolute certainty both ways (return either true or false immediately) by just comparing hash.

Consequently, computing the hash like this works out much better:

static const unsigned int primetable[256] = {
    1, 1429, 1433, 1439, 1447, 1451, 1453, 1459, 1471, 1481, 1483, 1487, 1489, 1493, 1499, 1511, 1523, 1531, 1543, 1549, 1553, 1559, 1567, 1571, 1579, 1583, 1597, 1601, 1607, 1609, 1613, 1619,
    821, 823, 827, 829, 839, 853, 857, 107, 859, 863, 877, 881, 883, 109, 887, 907, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 911, 919, 929, 937, 941, 947,
    601, 607, 613, 617, 619, 631, 641, 643, 647, 653, 659, 661, 673, 677, 683, 691, 701, 709, 719, 727, 733, 739, 743, 751, 757, 761, 769, 773, 787, 797, 809, 811,
    359, 11, 61, 31, 37, 3, 71, 43, 59, 7, 101, 79, 29, 53, 13, 23, 47, 103, 17, 5, 19, 41, 73, 83, 89, 67, 97, 367, 373, 379, 383, 389,
    1181, 1187, 1193, 1201, 1213, 1217, 1223, 1229, 1231, 1237, 1249, 1259, 1277, 1279, 1283, 1289, 1291, 1297, 1301, 1303, 1307, 1319, 1321, 1327, 1361, 1367, 1373, 1381, 1399, 1409, 1423, 1427,
    953, 967, 971, 977, 983, 991, 997, 1009, 1013, 1019, 1021, 1031, 1033, 1039, 1049, 1051, 1061, 1063, 1069, 1087, 1091, 1093, 1097, 1103, 1109, 1117, 1123, 1129, 1151, 1153, 1163, 1171,
    397, 401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541, 547, 557, 563, 569, 571, 577, 587, 593, 599,
    173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353
};

unsigned long prod = 1;
boolean overflow = false;
for (int i = 0; i < s.length(); i++)
{
    overflow |= (prod > ULONG_MAX / primetable[s[i]]);
    prod *= primetable[s[i]];
}
if (overflow)
    prod ^= 1;
s.hash = prod;

with fast-outs:

if (s1.hash != s2.hash) return false;
if ((s1.hash & 1) != 0) return true;
if (s1.length() != s2.length()) return false;

The middle line is only safe to use if the character encoding is not multi-byte. If you are working with a multi-byte coding scheme then the hash will still eliminate most non-anagrams, but it will have a lot of false positives as some byte orderings cannot be ignored.

Hacking Mats Petersson's test code to read from a dictionary, and trying this and the other algorithms on real English dictionary input we see roughly a factor of four improvement over the next best algorithm (it was a factor of ten, but I tweaked the other code):

Functionis_anagram      time:    218.9s hits: 93
Functionis_anagram1     time:      200s hits: 93
Functionis_anagram2     time:       40s hits: 93
Functionis_anagram3     time:     7.74s hits: 93
Functionis_anagram4     time:     2.65s hits: 93
Functionis_anagram5     time:      2.3s hits: 166
Functionis_anagram6     time:     0.56s hits: 166  <- This method

(the number of hits is different because the last two algorithms are both case-insensitive and my dictionary probably included acronyms colliding with natural words)

update: Although it's not what was asked, it was negligent of me to not point this out. I don't know if I didn't spot it or if I just got sick of typing or if I didn't want to make assumptions about the calling code...

Once you've hashed all the words, a trivial way to minimise the number of tests is to sort the list by that hash. Then you can trivially limit comparisons to parts of the list that are likely matches (according to the hash). This can also make branch prediction more efficient, too.

I just tried changing the N^2 iteration I tested with originally (I'm sure that was deliberately inefficient) to iterate over neighbours in a sorted list. The sort() call dominated the timing, but was 200x faster than the fastest N^2 test, and the choice of comparison algorithm no longer made any meaningful difference to performance.

Or you could just bin words by hash as you receive them.

回答7:

What about this solution? It's in C# if you don't mind.

public bool isAnagram(string first, string second) {
    if(first == null || second == null)
        return false;
    if(first.Length != second.Length)
        return false;
    string characters = "abcd...zABCD...Z";
    int firstResult = 0;
    int secondResult = 0;
    foreach(char x in second) {
        if(first.IndexOf(x) == -1)
            return false;
        if(char == " ")
            secondResult += 0;
        else
            secondResult += characters.IndexOf(x);
    }
    foreach(char x in first) {
        if(char == " ")
            firstResult += 0;
        else
            firstResult += characters.IndexOf(x);
    }
    return firstResult == secondResult;

}

来源：https://stackoverflow.com/questions/18123959/optimizing-very-often-used-anagram-function

标签

c++

string

algorithm

optimization

anagram