merge nearly similar rows with help of spacy

问题

I want to merge some rows if they are nearly similar.
Similarity can be checked by using spaCy.

df:

string                     
yellow color       
yellow color looks like 
yellow color bright
red color okay
red color blood

output:

string
yellow color looks like bright
red color okay blood

solution:
brute force approach is - for every item in string check similarity with other n-1 item if greater than some threshold value then merge.
Is there any other approach ?
As i am not in contact with much people idk how they do it
one idea coming into my mind is- can we pass some function to merge? if it is true then merge otherwise don't.

Any other popular approaches are welcomed.

回答1:

If you measure similarity by occurrence of common words, you don't even need spacy: just vectorize your texts using word count and feed to any clustering algotithsm. AgglomerativeClustering is one of them - it is not very time efficient for large datasets, but it is highly controllable. The only parameter you need to tune for your dataset is distance_threshold: the smaller it is, the more clusters will there be.

After clustering the texts, you can just concatenate all the unique words in each cluster (or do something smarter, depending on the ultimate problem you are trying to solve). The whole code could look like:

texts = '''yellow color       
yellow color looks like 
yellow color bright
red color okay
red color blood'''.split('\n')

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import Normalizer, FunctionTransformer
from sklearn.cluster import AgglomerativeClustering
from sklearn.pipeline import make_pipeline
model = make_pipeline(
    CountVectorizer(), 
    Normalizer(), 
    FunctionTransformer(lambda x: x.todense(), accept_sparse=True),
    AgglomerativeClustering(distance_threshold=1.0, n_clusters=None),
)
clusters = model.fit_predict(texts)
print(clusters)  # [0 0 0 1 1]

from collections import defaultdict
cluster2words = defaultdict(list)
for text, cluster in zip(texts, clusters):
    for word in text.split():
        if word not in cluster2words[cluster]:
            cluster2words[cluster].append(word)
result = [' '.join(wordlist) for wordlist in cluster2words.values()]
print(result)  # ['yellow color looks like bright', 'red color okay blood']

You need Spacy or any other framework with pre-trained models only if common words are not enough, and you want to capture semantic similarity. The whole pipeline would change only a little.

# !python -m spacy download en_core_web_lg
import spacy
import numpy as np
nlp = spacy.load("en_core_web_lg")

model = make_pipeline(
    FunctionTransformer(lambda x: np.stack([nlp(t).vector for t in x])),
    Normalizer(), 
    AgglomerativeClustering(distance_threshold=0.5, n_clusters=None),
)
clusters = model.fit_predict(texts)
print(clusters)  # [2 0 2 0 1]

You see that the clustering is clearly incorrect here, so it seems that Spacy word vectors are not appropriate for this particular problem.

If you want to use a pretrained model to capture semantic similarity between texts, I would suggest you use Laser instead. It is based explicitly on sentence embeddings, and it is highly multilingual:

# !pip install laserembeddings
# !python -m laserembeddings download-models
from laserembeddings import Laser
laser = Laser()

model = make_pipeline(
    FunctionTransformer(lambda x: laser.embed_sentences(x, lang='en')),
    Normalizer(), 
    AgglomerativeClustering(distance_threshold=0.8, n_clusters=None),
)
clusters = model.fit_predict(texts)
print(clusters)  # [1 1 1 0 0]

回答2:

I think you have not yet thought of the possibility of having, for example:

yellow color bright
yellow color I like
yellow color looks like

In these cases, you need to decide what to do: only merge 2 of them at random? All three?

After giving some thought to this, you might find out that what you really want to do is cluster the word embeddings, that is, separate them into non overlapping groups of similar elements (a group can have size equal to 1).

Luckily, there are a lot of existing solutions for this, each one with its pro and cons. DBSCAN, for example, runs in O(n log n).

来源：https://stackoverflow.com/questions/61748673/merge-nearly-similar-rows-with-help-of-spacy

标签

python

merge

nlp

data-science

spacy