How to train a model that will result in the similarity score between two news titles?

问题

I am trying to build a Fake news classifier and I am quite new in this field. I have a column "title_1_en" which has the title for fake news and another column called "title_2_en". There are 3 target labels; "agreed", "disagreed", and "unrelated" if the title of the news in column "title_2_en" agrees, disagrees or is unrelated to that in the first column.

I have tried calculating basic cosine similarity between the two titles after converting the words of the sentences into vectors. This has resulted in the the cosine similarity score but this needs a lot of improvement as synonyms and semantic relationship has not been considered at all.

def L2(vector):
    norm_value = np.linalg.norm(vector)
    return norm_value

def Cosine(fr1, fr2):
    cos = np.dot(fr1, fr2)/(L2(fr1)*L2(fr2))
    return cos

回答1:

The most important thing here is how you convert the two sentences into vectors. There are multiple ways to do that and the most naive way is:

Convert each and every word into a vector - this can be done using standard pre-trained vectors such as word2vec or GloVe.
Now every sentence is just a bag of word vectors. This needs to be converted into a single vector, ie., mapping a full sentence text to a vector. There are many ways to do this too. For a start, just take the average of the bag of vectors in the sentence.
Compute cosine similarity between the two sentence vectors.

Spacy's similarity is a good place to start which does the averaging technique. From the docs:

By default, spaCy uses an average-of-vectors algorithm, using pre-trained vectors if available (e.g. the en_core_web_lg model). If not, the doc.tensor attribute is used, which is produced by the tagger, parser and entity recognizer. This is how the en_core_web_sm model provides similarities. Usually the .tensor-based similarities will be more structural, while the word vector similarities will be more topical. You can also customize the .similarity() method, to provide your own similarity function, which can be trained using supervised techniques.

来源：https://stackoverflow.com/questions/55755962/how-to-train-a-model-that-will-result-in-the-similarity-score-between-two-news-t

标签

nlp

classification

gensim

cosine-similarity

sentence-similarity