How to cluster similar sentences using BERT

前端 未结 4 2050
难免孤独
难免孤独 2021-02-05 19:18

For ElMo, FastText and Word2Vec, I\'m averaging the word embeddings within a sentence and using HDBSCAN/KMeans clustering to group similar sentences.

A good example of t

4条回答
  •  梦如初夏
    2021-02-05 20:20

    You will need to generate bert embeddidngs for the sentences first. bert-as-service provides a very easy way to generate embeddings for sentences.

    This is how you can geberate bert vectors for a list of sentences you need to cluster. It is explained very well in the bert-as-service repository: https://github.com/hanxiao/bert-as-service

    Installations:

    pip install bert-serving-server  # server
    pip install bert-serving-client  # client, independent of `bert-serving-server`
    

    Download one of the pre-trained models available at https://github.com/google-research/bert

    Start the service:

    bert-serving-start -model_dir /your_model_directory/ -num_worker=4 
    

    Generate the vectors for the list of sentences:

    from bert_serving.client import BertClient
    bc = BertClient()
    vectors=bc.encode(your_list_of_sentences)
    

    This would give you a list of vectors, you could write them into a csv and use any clustering algorithm as the sentences are reduced to numbers.

提交回复
热议问题