Fuzzy matching of product names

前端 未结 11 1288
长发绾君心
长发绾君心 2020-12-12 16:28

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.

For example \"

11条回答
  •  时光说笑
    2020-12-12 17:22

    I worked on the exact same thing in the past. What I have done is using an NLP method; TF-IDF Vectorizer to assign weights to each word. For example in your case:

    Canon PowerShot a20IS

    • Canon --> weight = 0.05 (not a very distinguishing word)
    • PowerShot --> weight = 0.37 (can be distinguishing)
    • a20IS --> weight = 0.96 (very distinguishing)

    This will tell your model which words to care and which words to not. I had quite good matches thanks to TF-IDF. But note this: a20IS cannot be recognized as a20 IS, you may consider to use some kind of regex to filter such cases.

    After that, you can use a numeric calculation like cosine similarity.

提交回复
热议问题