How to String.Contains() the Fuzzy way in C#?

问题

I have a list of persons that I want to search for while filtering. Each time the user enters a search string, the filtering is applied.

There are two challenges to consider:

The user may enter part of names
The user may mistyping

The first one is simply resolved by searching for substrings e.g. String.Contains(). The second one could be resolved by using a Fuzzy Implementation (e.g. https://fuzzystring.codeplex.com)

But I don't know how to master both challenges simultaneously.

For example: I want to find the person "Dr. Martin Fowler" when entering one of:

"Martin"
"Fawler"
"Marten Fauler"

I guess I need to write a "FuzzyContains()" logic, that handle my needs and also has an acceptable performance. Any advices how to start?

回答1:

Seems to be a job for the Levenshtein distance algorithm (one of the dozens C# implementations).

You give this algorithm two strings (the one the user entered and one out of your list). Then it calculates how much characters must be replaced, added or removed to come from the first string to the second one. Then you can take all elements from your list where the distance is smaller or equal three (for example) to find simple typos.

If you have this method you could maybe use it that way:

var userInput = textInput.Text.ToLower();
var matchingEmployees = EmployeeList.Where(x => x.Name.ToLower().Contains(userInput)
                                                || LevenshteinDistance.Compute(x.Name.ToLower(), userInput) <= 3)
                                    .ToList();

回答2:

I modified Oliver answer who suggested the Levenshtein Distance algorithms, that is not the best choice here, since the calculated distance is to big when only parts of the names were entered. So, I ended up using the Longest Common Subsequence algorithm implemented by the awesome FuzzyString Lib.

const int TOLERANCE = 1;
string userInput = textInput.Text.ToLower();
var matchingPeople = people.Where(p =>
{
     //Check Contains
     bool contains = p.Name.ToLower().Contains(userInput);
     if(contains) return true;

     //Check LongestCommonSubsequence
     bool subsequenceTolerated = p.Name.LongestCommonSubsequence(userInput).Length >= userInput.Length - TOLERANCE;

     return subsequenceTolerated;
}).ToList();

回答3:

I've done this myself before and started with the some of the methods listed on wikipedia approximate string matching. When I got done, I tuned my algorithm in ways that were not as general purpose, but gave me better matches in my domain.

If your whole dictionary is in memory and not too large, you can simply apply you matching algorithm against every member in the dictionary. If your dictionary is large, this will likely overuse your resources and you will need a better algorithm. You might want to consider using full text search feature of your database too.

In my case, I iterated though each string in my dictionary comparing "matching runs", i.e., 2 points for having a 2 character match, 3 for a 3 character match up to an 8 character match. I ran though all possible of the pairs, triples, etc. -- scoring each dictionary entry and selecting the highest scoring match. Tolerates typos, word order, etc. but computationally expensive -- my dictionary was a most a few thousand phrases so this worked very well for me. This is a modified version of Dice's coefficient.

回答4:

Did you try brute force? The simplest way would be to match the search string with substrings of target strings starting from the beginning and then take closest match from all of the matches.

But this might not be acceptable from performance view.

回答5:

Maybe you could use this soundex implementation: CodeProject What soundex does is comparing two strings and calculating the "pronunciation-similarity" in percentage. A time ago i build a search with the help of this function(PHP hasit built-in)

回答6:

In other languages like python we have cool stuffs for text processing including distance computations. There are some algorithms such as Levenshtein that computes the fuzzy distance between two strings. I saw some implementations in C# (in here) and also another module was difflib which is available in here. the outputs of these algorithms is a number. the closer to 0 the better.

回答7:

I had a project for school some time ago, where we had a textbox in which students could search for every employee, student that has something to do with the school. We were talking about a couple of hundred people. Simple Linq query that we used was blazingly fast on a Core i3 processor. The query was called every time a user typed something in the textbox. In a TextChanged event we called a query that looked like this:

var resultData = EmployeeList.Where(x=>x.Name.ToLower().Contains(textInput.Text.ToLower())).ToList();

Of course, this logic applies only if you have "Dr. Martin Fowler" in one property or a member.

来源：https://stackoverflow.com/questions/25659859/how-to-string-contains-the-fuzzy-way-in-c

标签

fuzzy-search