Best machine learning technique for matching product strings

前端 未结 3 1431
说谎
说谎 2020-12-23 12:50

Here\'s a puzzle...

I have two databases of the same 50000+ electronic products and I want to match products in one database to those in the other. However, the prod

3条回答
  •  礼貌的吻别
    2020-12-23 13:52

    I don't know that much about machine learning, but I do know Levenshtein distance is not the best approach for this type of problem.

    I am working on an extremely similar problem currently, and have found much more accurate matches using Largest Consecutive Sub Sequence (https://www.geeksforgeeks.org/longest-consecutive-subsequence).

    You may also find Longest Common Substring helpful too (https://www.geeksforgeeks.org/longest-common-substring-dp-29/).

    ... Or maybe even a combination of both!

    Levenshtein is not great because it allows for substitutions, which can easily discount similar strings which have extra characters. For example, "Hello AAAAAA", "Hello", and "BBBBB".

    "Hello" and "BBBBB" are closer by Levenshtein distance, even though you would probably like "Hello" to match with "Hello AAAAAA".

    LCS and LSS do not allow substitutions, so with both of these methods, "Hello" would match with "Hello AAAAAA".

提交回复
热议问题