Using word2vec to classify words in categories

前端 未结 2 849
盖世英雄少女心
盖世英雄少女心 2020-12-28 09:08

BACKGROUND

I have vectors with some sample data and each vector has a category name (Places,Colors,Names).

[\'john\',\'jay\',\'dan\'         


        
相关标签:
2条回答
  • 2020-12-28 09:25

    If you're looking for the simplest / fastest solution then I'd suggest you take the pre-trained word embeddings (Word2Vec or GloVe) and just build a simple query system on top of it. The vectors have been trained on a huge corpus and are likely to contain good enough approximation to your domain data.

    Here's my solution below:

    import numpy as np
    
    # Category -> words
    data = {
      'Names': ['john','jay','dan','nathan','bob'],
      'Colors': ['yellow', 'red','green'],
      'Places': ['tokyo','bejing','washington','mumbai'],
    }
    # Words -> category
    categories = {word: key for key, words in data.items() for word in words}
    
    # Load the whole embedding matrix
    embeddings_index = {}
    with open('glove.6B.100d.txt') as f:
      for line in f:
        values = line.split()
        word = values[0]
        embed = np.array(values[1:], dtype=np.float32)
        embeddings_index[word] = embed
    print('Loaded %s word vectors.' % len(embeddings_index))
    # Embeddings for available words
    data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}
    
    # Processing the query
    def process(query):
      query_embed = embeddings_index[query]
      scores = {}
      for word, embed in data_embeddings.items():
        category = categories[word]
        dist = query_embed.dot(embed)
        dist /= len(data[category])
        scores[category] = scores.get(category, 0) + dist
      return scores
    
    # Testing
    print(process('pink'))
    print(process('frank'))
    print(process('moscow'))
    

    In order to run it, you'll have to download and unpack the pre-trained GloVe data from here (careful, 800Mb!). Upon running, it should produce something like this:

    {'Colors': 24.655489603678387, 'Names': 5.058711671829224, 'Places': 0.90213905274868011}
    {'Colors': 6.8597321510314941, 'Names': 15.570847320556641, 'Places': 3.5302454829216003}
    {'Colors': 8.2919375101725254, 'Names': 4.58830726146698, 'Places': 14.7840416431427}
    

    ... which looks pretty reasonable. And that's it! If you don't need such a big model, you can filter the words in glove according to their tf-idf score. Remember that the model size only depends on the data you have and words you might want to be able to query.

    0 讨论(0)
  • 2020-12-28 09:25

    Also, what its worth, PyTorch has a good and faster implementation of Glove these days.

    0 讨论(0)
提交回复
热议问题