I have a site which is searchable using Lucene. I've noticed from logs that users sometimes don't find what they're looking for because they enter a singular term, but only the plural version of that term is used on the site. I would like the search to find uses of other forms of a word as well. This is a problem that I'm sure has been solved many times over, so what are the best practices for this?
Please note: this site only has English content.
Some approaches I've thought of:
- Look up the word in some kind of thesaurus file to determine alternate forms of a given word.
- Some examples:
- Searches for "car", also add "cars" to the query.
- Searches for "carry", also add "carries" and "carried" to the query.
- Searches for "small", also add "smaller" and "smallest" to the query.
- Searches for "can", also add "can't", "cannot", "cans", and "canned" to the query.
- And it should work in reverse (i.e. search for "carries" should add "carry" and "carried").
- Drawbacks:
- Doesn't work for many new technical words unless the dictionary/thesaurus is updated frequently.
- I'm not sure about the performance of searching the thesaurus file.
- Some examples:
- Generate the alternate forms algorithmically, based on some heuristics.
- Some examples:
- If the word ends in "s" or "es" or "ed" or "er" or "est", drop the suffix
- If the word ends in "ies" or "ied" or "ier" or "iest", convert to "y"
- If the word ends in "y", convert to "ies", "ied", "ier", and "iest"
- Try adding "s", "es", "er" and "est" to the word.
- Drawbacks:
- Generates lots of non-words for most inputs.
- Feels like a hack.
- Looks like something you'd find on TheDailyWTF.com. :)
- Some examples:
- Something much more sophisticated?
I'm thinking of doing some kind of combination of the first two approaches, but I'm not sure where to find a thesaurus file (or what it's called, as "thesaurus" isn't quite right, but neither is "dictionary").
Consider including the PorterStemFilter
in your analysis pipeline. Be sure to perform the same analysis on queries that is used when building the index.
I've also used the Lancaster stemming algorithm with good results. Using the PorterStemFilter
as a guide, it is easy to integrate with Lucene.
Word stemming works OK for English, however for languages where word stemming is nearly impossible (like mine) option #1 is viable. I know of at least one such implementation for my language (Icelandic) for Lucene that seems to work very well.
Some of those look like pretty neat ideas. Personally, I would just add some tags to the query (query transformation) to make it fuzzy, or you can use the builtin FuzzyQuery, which uses Levenshtein edit distances, which would help for mispellings.
Using fuzzy search 'query tags', Levenshtein is also used. Consider a search for 'car'. If you change the query to 'car~', it will find 'car' and 'cars' and so on. There are other transformations to the query that should handle almost everything you need.
If you're working in a specialised field (I did this with horticulture) or with a language that does't play nicely with normal stemming methods you could use the query logging to create a manual stemming table.
Just create a word -> stem mapping for all the mismatches you can think of / people are searching for, then when indexing or searching replace any word that occurs in the table with the appropriate stem. Thanks to query caching this is a pretty cheap solution.
Stemming is a pretty standard way to address this issue. I've found that the Porter stemmer is way to aggressive for standard keyword search. It ends up conflating words together that have different meanings. Try the KStemmer algorithm.
来源:https://stackoverflow.com/questions/893436/best-practices-for-seaching-for-alternate-forms-of-a-word-with-lucene