What is the best algorithm for matching two string containing less than 10 words in latin script

前端 未结 5 1822
刺人心
刺人心 2021-02-04 09:35

I\'m comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a

5条回答
  •  借酒劲吻你
    2021-02-04 10:30

    They're all good. They work on different properties of strings and have different matching properties. What works best for you depends on what you need.

    I'm using the JaccardSimilarity to match names. I chose the JaccardSimilarity because it was reasonably fast and for short strings excelled in matching names with common typo's while quickly degrading the score for anything else. Gives extra weight to spaces. It is also insensitive to word order. I needed this behavior because the impact of a false positive was much much higher then that off a false negative, spaces could be typos but not often and word order was not that important.

    Note that this was done in combination with a simplifier that removes non-diacritics and a mapper that maps the remaining characters to the a-z range. This is passed through a normalizes that standardizes all word separator symbols to a single space. Finally the names are parsed to pick out initials, pre- inner- and suffixes. This because names have a structure and format to them that is rather resistant to just string comparison.

    To make your choice you need to make a list of what criteria you want and then look for an algorithm that satisfied those criteria. You can also make a reasonably large test set and run all algorithms on that test set too see what the trade offs are with respect to time, number of positives, false positives, false negatives and negatives, the classes of errors your system should handle, ect, ect.

    If you are still unsure of your choice, you can also setup your system to switch the exact comparison algorithms at run time. This allows you to do an A-B test and see which algorithm works best in practice.

    TLDR; which algorithm you want depends on what you need, if you don't know what you need make sure you can change it later on and run tests on the fly.

提交回复
热议问题