Fuzzy matching of product names

前端未结

关注

 11  1289

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.

For example \"

相关标签:

11条回答

逝去的感伤

2020-12-12 16:58

I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.

I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.

Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.

Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).

Not a perfect solution by any stretch, but I don't think you are expecting one?

0 讨论(0)

发布评论:

提交评论

加载中...

遇见更好的自我

2020-12-12 17:04

Spell checking algorithms come to mind.

Although I could not find a good sample implementation, I believe you can modify a basic spell checking algorithm to comes up with satisfactory results. i.e. working with words as a unit instead of a character.

The bits and pieces left in my memory:

Strip out all common words (a, an, the, new). What is "common" depends on context.

Take the first letter of each word and its length and make that an word key.

When a suspect word comes up, looks for words with the same or similar word key.

It might not solve your problems directly... but you say you were looking for ideas, right?

:-)

0 讨论(0)

发布评论:

提交评论

加载中...

春和景丽

2020-12-12 17:07

edg's answer is in the right direction, I think - you need to distinguish key words from fluff.

Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.

If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?

You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."

Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854

Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.

0 讨论(0)

发布评论:

提交评论

加载中...

夕颜

2020-12-12 17:07

You might be able to make use of a trigram search for this. I must admit I've never seen the algorithm to implement an index, but have seen it working in pharmaceutical applications, where it copes very well indeed with badly misspelt drug names. You might be able to apply the same kind of logic to this problem.

0 讨论(0)

发布评论:

提交评论

加载中...

眼角桃花

2020-12-12 17:09

That is exactly the problem I'm working on in my spare time. What I came up with is: based on keywords narrow down the scope of search:

in this case you could have some hierarchy:

type --> company --> model

so that you'd match "Digital Camera" for a type

"Canon" for company and there you'd be left with much narrower scope to search.

You could work this down even further by introducing product lines etc. But the main point is, this probably has to be done iteratively.

0 讨论(0)

发布评论:

提交评论

加载中...

佛祖请我去吃肉

2020-12-12 17:13

This is a problem of record linkage. The dedupe python library provides a complete implementation, but even if you don't use python, the documentation has a good overview of how to approach this problem.

Briefly, within the standard paradigm, this task is broken into three stages

Compare the fields, in this case just the name. You can use one or more comparator for this, for example an edit distance like the Levenshtein distance or something like the cosine distance that compares the number of common words.

Turn an array fo distance scores into a probability that a pair of records are truly about the same thing

Cluster those pairwise probability scores into groups of records that likely all refer to the same thing.

0 讨论(0)

发布评论:

提交评论

加载中...

1 2 下一页

验证码

看不清?

提交回复