Fuzzy matching multiple words in string

▼魔方 西西 提交于 2019-12-08 07:33:17

问题


I'm trying to employ the help of the Levenshtein Distance to find fuzzy keywords(static text) on an OCR page.
To do this, I want to give a percentage of errors that are allowed (say, 15%).

string Keyword = "past due electric service";

Since the keyword is 25 characters long, I want to allow for 4 errors (25 * .15 rounded up)
I need to be able to compare it to...

string Entire_OCR_Page = "previous bill amount payment received on 12/26/13 thank 
                          you! current electric service total balances unpaid 7 
                          days after the total due date are subject to a late 
                          charge of 7.5% of the amount due or $2.00, whichever/5 
                          greater. "

This is how I am doing it now...

int LevenshteinDistance = LevenshteinAlgorithm(Keyword, Entire_OCR_Page); // = 202   
int NumberOfErrorsAllowed = 4;   
int Allowance = (Entire_OCR_Page.Length() - Keyword.Length()) + NumberOfErrorsAllowed; // = 205

Clearly, Keyword is not found in OCR_Text (which it shouldn't be). But, using Levenshtein's Distance, the number of errors is less than the 15% leeway (therefore my logic says it's found).

Does anyone know of a better way to do this?


回答1:


Answered My Question with the use of sub-strings. Posting in case others run into the same type of problem. A little unorthodox, but it works great for me.

int TextLengthBuffer = (int)StaticTextLength - 1; //start looking for correct result with one less character than it should have.
int LowestLevenshteinNumber = 999999; //initialize insanely high maximum
decimal PossibleStringLength = (PossibleString.Length); //Length of string to search
decimal StaticTextLength = (StaticText.Length); //Length of text to search for
decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero); //Find number of errors allowed with given ErrorAllowance percentage

    //Look for best match with 1 less character than it should have, then the correct amount of characters.
    //And last, with 1 more character. (This is because one letter can be recognized as 
    //two (W -> VV) and visa versa) 

for (int i = 0; i < 3; i++) 
{
    for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++)
    {
        string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer));
        int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero));
        int lNumber = LevenshteinAlgorithm(StaticText, possibleResult);

        if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber)))
        {
            PossibleResult = (new StaticTextResult { text = possibleResult, errors = lNumber });
            LowestLevenshteinNumber = lNumber;
        }
    }
    TextLengthBuffer++;
}




public static int LevenshteinAlgorithm(string s, string t) // Levenshtein Algorithm
{
    int n = s.Length;
    int m = t.Length;
    int[,] d = new int[n + 1, m + 1];

    if (n == 0)
    {
        return m;
    }

    if (m == 0)
    {
        return n;
    }

    for (int i = 0; i <= n; d[i, 0] = i++)
    {
    }

    for (int j = 0; j <= m; d[0, j] = j++)
    {
    }

    for (int i = 1; i <= n; i++)
    {
        for (int j = 1; j <= m; j++)
        {
            int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

            d[i, j] = Math.Min(
                Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                d[i - 1, j - 1] + cost);
        }
    }
    return d[n, m];
}



回答2:


I think it's not working because a large chunk of your string is matched. So what I'd do, is try splitting your Keyword into separate words.

Then find all places where those words are matched in your OCR_TEXT.

Then look at all those places where they matched and see if any 4 of those places are consecutive and match the original phrase.

Am unsure if my explanation is clear?



来源:https://stackoverflow.com/questions/21615391/fuzzy-matching-multiple-words-in-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!