NLP - Find Similar/Phonetic word and calculate score in a paragraph

问题

I'm developing a simple NLP project, where we have given a set of words and to find the similar/phonetically similar word from a text. I've found a lot of algorithms but not a sample application.

Also it should give the similarity score by comparing the keyword and the word that are found.

Can anyone help me out?

    def word2vec(word):
    from collections import Counter
    from math import sqrt

    cw = Counter(word)
    sw = set(cw)
    lw = sqrt(sum(c*c for c in cw.values()))
    return cw, sw, lw

def cosdis(v1, v2):
    common = v1[1].intersection(v2[1])
    return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]

list_A = ['e-commerce', 'ecomme', 'e-commercy', 'ecomacy', 'E-Commerce']
list_B = ['E-Commerce']

for word in list_A:
    for key in list_B:
            res = cosdis(word2vec(word), word2vec(key))
            print(res)

This code only does word to word comparison.

Can anyone help me out?

回答1:

I think you are referring to something like an API that could first convert word into IPA symbols (a form of phonetic notation) and you then compare the IPA symbols.

from collections import Counter
from math import sqrt
import eng_to_ipa as ipa

def word2vec(word):
    cw = Counter(word)
    sw = set(cw)
    lw = sqrt(sum(c*c for c in cw.values()))
    return cw, sw, lw

def cosdis(v1, v2):
    common = v1[1].intersection(v2[1])
    return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]

list_A = ['e-commerce', 'ecomme', 'e-commercy', 'ecomacy', 'E-Commerce']
list_B = ['E-Commerce']

IPA_list_a = []
IPA_list_b = []
for each in list_A:
    IPA_list_a.append(ipa.convert(each))
for each in list_B:
    IPA_list_b.append(ipa.convert(each))

for word in IPA_list_a:
    for key in IPA_list_b:
            res = cosdis(word2vec(word), word2vec(key))
            print(res)

Check this out : [https://github.com/mphilli/English-to-IPA][1]

>>> import eng_to_ipa as ipa
>>> ipa.convert("The quick brown fox jumped over the lazy dog.")
'ðə kwɪk braʊn fɑks ʤəmpt ˈoʊvər ðə ˈleɪzi dɔg.'

Example is founded from the above github link.

回答2:

Cosine Similarity is mostly used with strings rather than comparing one word with another word. I would rather recommend you to use something like Levenshteien Distance(also called Edit Distance).

Edit Distance (a.k.a. Levenshtein Distance) is a measure of similarity between two strings referred to as the source string and the target string. The distance between the strings is describes as the minimum number of edit operations (deletions, insertions, or substitutions) required to transform the source into the target. So, the lower distance, the higher is the chance that they are very similar.

You can use it via the nltk library of Python in this manner:

import nltk

w1 = 'mapping'
w2 = 'mappings'

nltk.edit_distance(w1, w2)

In this case the output is returned as 1, as there is a single letter difference between w1 and w2

来源：https://stackoverflow.com/questions/60427951/nlp-find-similar-phonetic-word-and-calculate-score-in-a-paragraph

标签

python

python-3.x

nlp

cosine-similarity