How to cluster similar sentences using BERT

前端未结

关注

 4  2063

For ElMo, FastText and Word2Vec, I\'m averaging the word embeddings within a sentence and using HDBSCAN/KMeans clustering to group similar sentences.

A good example of t

相关标签:

4条回答

情深已故

2021-02-05 20:02

Bert adds a special [CLS] token at the beginning of each sample/sentence. After fine-tuning on a downstream task, the embedding of this [CLS] token or pooled_output as they call it in the hugging face implementation represents the sentence embedding.

But I think that you don't have labels so you won't be able to fine-tune, therefore you cannot use the pooled_output as a sentence embedding. Instead you should use the word embeddings in encoded_layers which is a tensor with dimensions (12,seq_len, 768). In this tensor you have the embeddings(dimension 768) from each of the 12 layers in Bert. To get the word embeddings you can use the output of the last layer, you can concatenate or sum the output of the last 4 layers and so on.

Here is the script for extracting the features https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/extract_features.py

0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2021-02-05 20:06

You can use Sentence Transformers to generate the sentence embeddings. These embeddings are much more meaningful as compared to the one obtained from bert-as-service, as they have been fine-tuned such that semantically similar sentences have higher similarity score. You can use FAISS based clustering algorithm if number of sentences to be clustered are in millions or more as vanilla K-means like clustering algorithm takes quadratic time.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2021-02-05 20:17

Not sure if you still need it but recently a paper mentioned how to use document embeddings to cluster documents and extract words from each cluster to represent a topic. Here's the link: https://arxiv.org/pdf/2008.09470.pdf, https://github.com/ddangelov/Top2Vec

Inspired by the above paper, another algorithm for topic modelling using BERT to generate sentence embeddings is mentioned here: https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6, https://github.com/MaartenGr/BERTopic

The above two libraries provide an end-to-end solution to extract topics from a corpus. But if you're interested only in generating sentence embeddings, look at Gensim's doc2vec (https://radimrehurek.com/gensim/models/doc2vec.html) or at sentence-transformers (https://github.com/UKPLab/sentence-transformers) as mentioned in the other answers. If you go with sentence-transformers, it is suggested that you train a model on you're domain specific corpus to get good results.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦如初夏

2021-02-05 20:20
You will need to generate bert embeddidngs for the sentences first. bert-as-service provides a very easy way to generate embeddings for sentences.

This is how you can geberate bert vectors for a list of sentences you need to cluster. It is explained very well in the bert-as-service repository: https://github.com/hanxiao/bert-as-service

Installations:
```
pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`
```
Download one of the pre-trained models available at https://github.com/google-research/bert

Start the service:
```
bert-serving-start -model_dir /your_model_directory/ -num_worker=4 
```
Generate the vectors for the list of sentences:
```
from bert_serving.client import BertClient
bc = BertClient()
vectors=bc.encode(your_list_of_sentences)
```
This would give you a list of vectors, you could write them into a csv and use any clustering algorithm as the sentences are reduced to numbers.
0 讨论(0)
发布评论:

提交评论
- 加载中...