What is a good strategy to group similar words?

前端 未结 5 1729
孤城傲影
孤城傲影 2020-12-29 11:35

Say I have a list of movie names with misspellings and small variations like this -

 \"Pirates of the Caribbean: The Curse of the Black Pearl\"
 \"Pirates o         


        
相关标签:
5条回答
  • 2020-12-29 11:48

    Have a look at "fuzzy matching". Some great tools in the thread below that calculates similarities between strings.

    I'm especially fond of the difflib module

    >>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
    ['apple', 'ape']
    >>> import keyword
    >>> get_close_matches('wheel', keyword.kwlist)
    ['while']
    >>> get_close_matches('apple', keyword.kwlist)
    []
    >>> get_close_matches('accept', keyword.kwlist)
    ['except']
    

    https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

    0 讨论(0)
  • 2020-12-29 11:53

    I believe there is in fact two distinct problems.

    The first is spell correction. You can have one in Python here

    http://norvig.com/spell-correct.html

    The second is more functional. Here is what I'd do after the spell correction. I would make a relation function.

    related( sentence1, sentence2 ) if and only if sentence1 and sentence2 have rare common words. By rare, I mean words different than (The, what, is, etc...). You can take a look at the TF/IDF system to determine if two document are related using their words. Just googling a bit I found this:

    https://code.google.com/p/tfidf/

    0 讨论(0)
  • 2020-12-29 12:01

    You might notice that similar strings have large common substring, for example:

    "Bla bla bLa" and "Bla bla bRa" => common substring is "Bla bla ba" (notice the third word)

    To find common substring you may use dynamic programming algorithm. One of algorithms variations is Levenshtein distance (distance between most similar strings is very small, and between more different strings distance is bigger) - http://en.wikipedia.org/wiki/Levenshtein_distance.

    Also for quick performance you may try to adapt Soundex algorithm - http://en.wikipedia.org/wiki/Soundex.

    So after calculating distance between all your strings, you have to clusterize them. The most simple way is k-means (but it needs you to define number of clusters). If you actually don't know number of clusters, you have to use hierarchical clustering. Note that number of clusters in your situation is number of different movies titles + 1(for totally bad spelled strings).

    0 讨论(0)
  • 2020-12-29 12:02

    To add another tip to Fredrik's answer, you could also get inspired from search engines like code, such as this one :

    def dosearch(terms, searchtype, case, affffdir, files = []):
        found = []
        if files != None:
            titlesrch = re.compile('>title<.*>/title<')
            for file in files:
                title = ""
                if not (file.lower().endswith("html") or file.lower().endswith("htm")):
                    continue
                filecontents = open(BASE_DIR + affffdir + file, 'r').read()
                titletmp = titlesrch.search(filecontents)
                if titletmp != None:
                    title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8]
                filecontents = remove_tags(filecontents)
                filecontents = filecontents.lstrip()
                filecontents = filecontents.rstrip()
                if dofind(filecontents, case, searchtype, terms) > 0:
                    found.append(title)
                    found.append(file)
        return found
    

    Source and more information: http://www.zackgrossbart.com/hackito/search-engine-python/

    Regards,

    Max

    0 讨论(0)
  • 2020-12-29 12:02

    One approach would be to pre-process all the strings before you compare them: convert all to lowercase, standardize whitespace (eg, replace any whitespace with single spaces). If punctuation is not important to your end goal, you can remove all punctuation characters as well.

    Levenshtein distance is commonly-used to determine similarity of a string, this should help you group strings which differ by small spelling errors.

    0 讨论(0)
提交回复
热议问题