Here\'s a puzzle...
I have two databases of the same 50000+ electronic products and I want to match products in one database to those in the other. However, the prod
I don't know that much about machine learning, but I do know Levenshtein distance is not the best approach for this type of problem.
I am working on an extremely similar problem currently, and have found much more accurate matches using Largest Consecutive Sub Sequence (https://www.geeksforgeeks.org/longest-consecutive-subsequence).
You may also find Longest Common Substring helpful too (https://www.geeksforgeeks.org/longest-common-substring-dp-29/).
... Or maybe even a combination of both!
Levenshtein is not great because it allows for substitutions, which can easily discount similar strings which have extra characters. For example, "Hello AAAAAA", "Hello", and "BBBBB".
"Hello" and "BBBBB" are closer by Levenshtein distance, even though you would probably like "Hello" to match with "Hello AAAAAA".
LCS and LSS do not allow substitutions, so with both of these methods, "Hello" would match with "Hello AAAAAA".