发表新帖

发表新帖

How to cluster similar sentences using BERT

前端未结

关注

 4  2062

难免孤独 2021-02-05 19:18

For ElMo, FastText and Word2Vec, I\'m averaging the word embeddings within a sentence and using HDBSCAN/KMeans clustering to group similar sentences.

A good example of t

4条回答

情深已故 (楼主)

2021-02-05 20:02

Bert adds a special [CLS] token at the beginning of each sample/sentence. After fine-tuning on a downstream task, the embedding of this [CLS] token or pooled_output as they call it in the hugging face implementation represents the sentence embedding.

But I think that you don't have labels so you won't be able to fine-tune, therefore you cannot use the pooled_output as a sentence embedding. Instead you should use the word embeddings in encoded_layers which is a tensor with dimensions (12,seq_len, 768). In this tensor you have the embeddings(dimension 768) from each of the 12 layers in Bert. To get the word embeddings you can use the output of the last layer, you can concatenate or sum the output of the last 4 layers and so on.

Here is the script for extracting the features https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/extract_features.py

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

热议问题