Need help expanding on an anagram regex

问题

I am attempting to expand on this regex for listing all possible anagrams for a given set of letters:

^(?!.*([aer]).*\1)(?!(.*d){4})([aerd]*|[a-z])$

so far based on this regex, I can receive a match on any combination of words and sub-words made up of the letters 'dadder', such as 'adder', 'add', 'ad', 'red' etc. The reason for the regex complexity instead of a simple [dadder]* is because obviously each letter can be matched an infinite amount of times, which is bad, I want each letter to match the test string only once, if two d's are provided, it can match up to two times only or less. If somebody of course could streamline the regex to match any combinations of letters exactly X times specified, please feel free to provide it :)

However my main question, I would now like to incorporate a full stop character ".". If a full stop is ever encountered in the list of characters, it acts as a wildcard and could match any character a-z. So dadd.r could match daddzr, daddor, daddpr, rpdadd etc.

Could anybody help me with this?

回答1:

This is not a problem that should be solved with a regex, as nhahtdh's amusing answer should convince you.

Regexes are good at matching patterns. They are not a tool for solving set-based problems, which is what you are trying to use them for.

You really need an algorithmic approach, because that is the nature of the problem. This question covers just such a topic.

回答2:

The first part of the question is a duplicate of this question: Check if string is subset of a bunch of characters? (RegEx)?

This answer is dedicated to tackle the actual problem you are facing (the second part of the question).

A very simple solution would be using 2 maps: one to map the frequencies of the characters in the original set, and takes note of the number of ., the other to map the frequencies of the characters for each input string.

Pseudocode:

// I assume the maps return 0 for non existent entries
// Depending on the input, the map can simply be an array, or a tree/hash map

function checkAnagramExtended(originalString, inputString):
    if (inputString.length > originalString.length):
        return false

    // The frequency mapping for original string (ref stands for reference)
    // Ideally, refMap should be filled up once instead of every call
    // to this function
    var refMap = countFrequency(originalString)
    // The frequency mapping for input string
    var inpMap = empty map

    foreach (character c in inputString):

        if (inpMap[c] >= refMap[c]):
            // You may want to check that c is a character allowed
            // to be substituted by dot .
            // if (!canBeSubstitutedByDot(c)):
            //     return false

            if (inpMap['.'] >= refMap['.']):
                return false
            else:
                inpMap['.'] += 1

        else:
            inpMap[c] += 1

    return true

Appendix: Extending regex solution?

Your dot . extension, which allow any character from a-z to be matched makes the regex solution becomes even more impractical.

In my solution for the other problem, I relied heavily on the negative look-ahead to assert the count of a particular character is less than the maximum number of characters in the multiset of characters.

The dot . extension can vary the maximum number of characters allowed for any of the characters, thus breaks my solution above. If you force regex to do the job, it is possible to generate regex if there is only 1 ., but things explodes when you increase it to 2.

回答3:

Ok, after much toiling around attempting to get this going as a Regex, I gave in due to incomplete wildcard support and slow processing times.

I've now converted my requirements to a C# function and I'm actually a LOT more comfortable and happier now because it's also about 400% quicker which is great.

This will check if the given word is an anagram or sub-anagram of a set of letters with wildcard support via the (.).

Where letters is the letters to test against for anagrams.

Where dictionaryData is a List<string> of words to test for.

var letterCounts = letters.Select(x => x)
  .GroupBy(x => x)
  .ToDictionary(x => x.Key, x => x.Count());

var containsWildcards = letters.IndexOf('.') >= 0;
foreach (var dictWord in dictionaryData)
{
    var matches = 0;
    var dictWordLength = dictWord.Length;
    if (dictWordLength > letters.Length)
        continue;
    var addedChars = new List<char>();
    foreach (var dictLetter in dictWord)
    {
        var foundLetter = false;
        if (letterCounts.ContainsKey(dictLetter) &&
            addedChars.Count(x => x == dictLetter) < letterCounts[dictLetter])
        {
            if (letters.IndexOf(dictLetter) >= 0)
                foundLetter = true;
        }
        else if (containsWildcards &&
            addedChars.Count(x => x == '.') < letterCounts['.'])
        {
            addedChars.Add('.');
            foundLetter = true;
        }
        if (foundLetter)
        {
            addedChars.Add(dictLetter);
            matches++;
        }
        if (dictWordLength == matches)
            break;
    }

    if (dictWordLength <= matches)
    {
        // We have a match!
    }
}

Hope it can help someone else too.

来源：https://stackoverflow.com/questions/15156431/need-help-expanding-on-an-anagram-regex

标签

regex

anagram