Fuzzy sentence search algorithms

牧云@^-^@ 提交于 2019-12-21 20:15:14

问题


Suppose I have a set of phrases - about 10 000 - of average length - 7-20 words in which I want to find some given phrase. The phrase I am looking for could have some errors - for example miss one or two words, have some words misplaced, or some random words - for example my database contains "As I was riding my red bike, I saw Christine", and I want it to much "As I was riding my blue bike, saw Christine", or "I was riding my bike, I saw Christine and Marion". What could be some good approach to this problem? I know about Levenhstein's distance, and I also suppose that this problem may have no easy, good solution.


回答1:


A good text search engine will provide capabilities such as you describe, fsh. A typical approach would be to create a query that matches if any of the words occurs and orders the results using a weight based on number of terms occurring in proximity to each other and weighted inversely to their probability of occurring, since uncommon words will be less likely to co-occur by chance. There's a whole theory of this sort of thing called information retrieval, but maybe you know about that. Furthermore you'd like to make sure that word-level fuzziness gets accounted for by normalizing case, punctuation and the like and applying some basic linguistic transformations (stemming), and in some cases introducing a dictionary of synonyms, especially when there is domain knowledge available to condition it.

If you're interested in messing around with this stuff, try an open-source search engine, this article by Vik gives a reasonable survey from the perspective of 2009, and this one by Middleton and Baeza-Yates gives a good detailed introduction to the topic.



来源:https://stackoverflow.com/questions/7113008/fuzzy-sentence-search-algorithms

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!