Fuzzy regular expressions

问题

I am looking for a way to do a fuzzy match using regular expressions. I'd like to use Perl, but if someone can recommend any way to do this that would be helpful.

As an example, I want to match a string on the words "New York" preceded by a 2-digit number. The difficulty comes because the text is from OCR of a PDF, so I want to do a fuzzy match. I'd like to match:

12 New York
24 Hew York
33 New Yobk

and other "close" matches (in the sense of the Levenshtein distance), but not:

aa New York
11 Detroit

Obviously, I will need to specify the allowable distance ("fuzziness") for the match.

As I understand it, I cannot use the String::Approx Perl module to do this, because I need to include a regular expression in my match (to match the preceding digits).

Also, I should note that this is a very simplified example of what I'm really trying to match, so I'm not looking for a brute-force approach.

Edited to add:

Okay, my first example was too simple. I didn't mean for people to get hung up on the preceding digits -- sorry about the bad example. Here's a better example. Consider this string:

ASSIGNOR, BY MESHS ASSIGN1IBNTS, TO ALUSCHALME&S MANOTAC/rURINGCOMPANY, A COBPOBATlOH OF DELAY/ABE.

What this actually says is:

ASSIGNOR, BY MESNE ASSIGNMENTS, TO ALLIS-CHALMERS MANUFACTURING COMPANY, A CORPORATION OF DELAWARE

What I need to do is extract the phrase "ALUSCHALME&S MANOTAC/rURINGCOMPANY" and "DELAY/ABE". (I realize this might seem like madness. But I'm an optimist.) In general, the pattern will look something like this:

/Assignor(, by mesne assignments,)? to (company name), a corporation of (state)/i

where the matching is fuzzy.

回答1:

If you have one pattern you want to find the best match against a text collection you can try q-gram distance. It is quite easy to implement and adopt to special needs.

Your second description actually was helpful here, because the pattern and texts should be fairly long. q-gram distance does not work well with words like "York", but if your typical pattern is a whole address, that should be fine.

Try it like this:

transform your texts and patterns into a reduced character set, like uppercase-only, stripping, wordifying (one space between words) all symbols replaced by "#" or something.
choose a q-gram length, to work with. Try 3 or 2. We call this q=3.
than, build a qgram-profile of each text:
split each text into q-words, ie. NEW_YORK becomes [NEW, EW_, W_Y, _YO, ORK], store this away with each text.
if you search for your pattern then, you do the same with your pattern,
loop through your text-qgram-database and
- count for each pattern/text-pair how many qgrams are the same.
- each hit will raise the score by 1.
the texts with the highest score(s) are your best hits.

If you did that you can tweak this algorithm by:

prepend all you texts (and also the pattern before search), with q-1 special chars, so even your short words will get a decent profile. For example New York becomes ^^NEW YORK$$.
You can even play around with replacing all consonants with "x" and vowels with "o" and so on. Play around with a couple of character classes this way, or even create super symbols by replacing groups of character by one other, i.e. CK becomes K, or SCH becomes $.
when raising the score by a qgram-hit you can adjust the value of 1 by other things, like length-difference of text vs pattern.
store 2-grams and 3-grams both, and when counting, weigh then differently.

Note that this algorithm in the here described basic form does not have a good running time during search, i.e. O(|T|*|P|) (with |T| and |P| the total lengths of your text and pattern). This is because I described that you loop over all your texts, and then over your pattern. Therefore this is only practical for a medium-sized texts-base. If you spend some thought, you can create an advanced index structure over the q-grams (maybe using hashtables), so this might be practical for huge texts-bases as well.

回答2:

Regexes have specific rules, they aren't built for doing what you want. It's going to be much easier to make two passes at it. Use a regex to strip off the numbers and then use a module to get your match close.

Something like this (assuming your input is lines from a file)

while( my $line = <$fh> ) {
    chomp $line;

    # do we have digits?
    if( $line =~ /^\d+/ ) {
         # removes spaces and digits from the beginning of the line
         $line =~ s/^[\d\s]*//g;

         # use your module to determine if you have a match in the remaining text.
         if( module_match ) {
             # do something
         }
         else {
             #no match
         }
    }
    else {
        # no match
    }
}

回答3:

Separate the problem into two parts:

Match the double-digit number.
Fuzzily match the residue against 'New York'.

In the example, you know that 'New York' consists of 2 words; you might be able to leverage that to eliminate alternatives like 'Detroit' (but not necessarily 'San Francisco') more easily.

You might even be able to use 'String::Approx' after all, though it mentions:

... the Text::Levenshtein and Text::LevenshteinXS modules in CPAN. See also Text::WagnerFischer and Text::PhraseDistance.

(My Perl was unable to find Text::PhraseDistance via CPAN - the others are available and install OK.)

回答4:

You could try using something like Web 1T 5-gram Version 1 and a conditional likelihood maximization approach.

If I recall correctly, Chapter 14 of Beautiful Data is devoted to this data set and how to use it to spot spelling errors etc.

回答5:

Have you considered a two-stage test, using regex to enforce the requirement of [0-9]{2,2} (.*), then capturing the remaining text and doing a fuzzy match on it? Try thinking of the problem as an intersection of a regular expression and a fuzzy string.

回答6:

Well you can narrow down your candidates with Text::Levenshtein to get the edit distance and grepping by a comparison to the limit.

But another idea is that you can take the correct form and create a hash keyed from near-misses pointing to the proper form so that those might become candidates as well.

For regexes, you possibly would have to use the experimental code sections, perhaps something like this:

m/ (?i: [new] | \p{Alpha} (?{ $misses++ }) ){2,4}
   \s+
  (?i: [york] | \p{Alpha} (?{ $misses++ }) ){3,5}
 /x

Although in this case, you'd probably have to have a regex per proper value. You probably want some flag indicating when you missed your target.

回答7:

Rule of thumb: When you have to go to Stack Overflow and ask "How can I do X in a single regex?" you should consider doing X with more than just a single regex.

Based on your edits, I would do something like this:

while(<>) {
  chomp;
  if(/assignor, by (\w+) (\w+), to (\w+), a (\w+) of (\w+)/i) {
    # now use String::Approx to check that $1, $2, $3, $4, and $5 match
  } else {
    warn "Errors!\n";
  }
}

I'm not giving you everything here. I didn't make the ", by (\w+) (\w+)" bit optional to simplify the regex so you could get the gist of it. To do that you'll probably need to resort to named captures and the (?:) non-capturing group. I didn't feel like delving into all that, just wanted to help you understand how I would approach this.

Remember: If you have to ask "How do I do it all in a single regex?" you should stop trying to do it all in a single regex.

回答8:

Did you look into using Jarkko’s String::Approx module on CPAN? It has the agrep algorithm in it, but is much slower than Udi’s.

回答9:

Although you specified perl, there is a useful algorithm built into R that implements Levenshtein edit distances.

agrep()

This command also allows the use of any regular expression or pattern to match. I would recommend you look at it. http://stat.ethz.ch/R-manual/R-devel/library/base/html/agrep.html

回答10:

Python regex module provide a way to do fuzzy matching within regexes:

https://pypi.org/project/regex/ (look for Approximate “fuzzy” matching)

The fuzziness of a regex item is specified between “{” and “}” after the item.

Examples:

foo match “foo” exactly
(?:foo){i} match “foo”, permitting insertions
(?:foo){d} match “foo”, permitting deletions
(?:foo){s} match “foo”, permitting substitutions
(?:foo){i,s} match “foo”, permitting insertions and substitutions
(?:foo){e} match “foo”, permitting errors
If a certain type of error is specified, then any type not specified will not be permitted.

In the following examples I’ll omit the item and write only the fuzziness:

{d<=3} permit at most 3 deletions, but no other types
{i<=1,s<=2} permit at most 1 insertion and at most 2 substitutions, but no deletions
{1<=e<=3} permit at least 1 and at most 3 errors
{i<=2,d<=2,e<=3} permit at most 2 insertions, at most 2 deletions, at most 3 errors in total, but no substitutions

So you could write, eg:

import regex, pprint

m = regex.compile( r'(?:Assignor(, by mesne assignments,)? to (company name), a corporation of (state)){e}', regex.IGNORECASE ).match('ASSIGNOR, BY MESHS ASSIGN1IBNTS, TO ALUSCHALME&S MANOTAC/rURINGCOMPANY, A COBPOBATlOH OF DELAY/ABE.')

pprint.pprint(m)
pprint.pprint(m.groups())

This does not work right away, the result would be:

<regex.Match object; span=(0, 71), match='ASSIGNOR, BY MESHS ASSIGN1IBNTS, TO ALUSCHALME&S MANOTAC/rURINGCOMPANY,', fuzzy_counts=(45, 0, 0)>
(', BY MESHS ASSIGN1IBNTS', ' ALUSCHALME&', 'PANY,')

But giving it some more tweaking (eg you could specify a maximum number of errors for each capture group) you should be able to reach you goal.

来源：https://stackoverflow.com/questions/4155840/fuzzy-regular-expressions

标签

regex

perl

fuzzy-comparison