问题
I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches:
A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity.
Bag-of-Words: tf-idf or its variations such as BM25.
Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document similarity?
Is there another approach, that allows to dynamically refine the document's vectors as more text is added?
回答1:
- doc2vec or word2vec ?
According to article, the performance of doc2vec or paragraph2vec is poor for short-length documents. [Learning Semantic Similarity for Very Short Texts, 2015, IEEE]
- Short-length documents ...?
If you want to compare the similarity between short documents, you might want to vectorize the document via word2vec.
- how construct ?
For example, you can construct a document vector with a weighted average vector using tf-idf.
- similarity measure
In addition, I recommend using ts-ss rather than cosine or euclidean for similarity.
Please refer to the following article or the summary in github below. "A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering"
https://github.com/taki0112/Vector_Similarity
thank you
回答2:
You have to try it: the answer may vary based on your corpus and application-specific perception of 'similarity'. Effectiveness may especially vary based on typical document lengths, so if "rapidly growing with time" also means "growing arbitrarily long", that could greatly affect what works over time (requiring adaptations for longer docs).
Also note that 'Paragraph Vectors' – where a vector is co-trained like a word vector to represent a range-of-text – may outperform a simple average-of-word-vectors, as an input to similarity/classification tasks. (Many references to 'Doc2Vec' specifically mean 'Paragraph Vectors', though the term 'Doc2Vec' is sometimes also used for any other way of turning a document into a single vector, like a simple average of word-vectors.)
You may also want to look at "Word Mover's Distance" (WMD), a measure of similarity between two texts that uses word-vectors, though not via any simple average. (However, it can be expensive to calculate, especially for longer documents.) For classification, there's a recent refinement called "Supervised Word Mover's Distance" which reweights/transforms word vectors to make them more sensitive to known categories. With enough evaluation/tuning data about which of your documents should be closer than others, an analogous technique could probably be applied to generic similarity tasks.
回答3:
You also might consider trying Jaccard similarity, which uses basic set algebra to determine the verbal overlap in two documents (although it is somewhat similar to a BOW approach). A nice intro on it can be found here.
来源:https://stackoverflow.com/questions/42643074/document-similarity-vector-embedding-versus-tf-idf-performance