How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

后端未结

关注

 2  1875

独厮守ぢ 2021-02-13 04:29

I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For

2条回答

一整个雨季 (楼主)

2021-02-13 04:44
To get a fixed length feature vector for each sentence, although the number of words in each sentence is different, do as follows:
1. tokenize each sentence into constituent words
2. for each word get word vector (if it is not there ignore the word)
3. average all the word vectors you got
4. this will always give you a d-dim vector (d is word vector dim)
below is the code snipet
```
def getWordVecs(words, w2v_dict):
    vecs = []
    for word in words:
        word = word.replace('\n', '')
        try:
            vecs.append(w2v_model[word].reshape((1,300)))
        except KeyError:
            continue
    vecs = np.concatenate(vecs)
    vecs = np.array(vecs, dtype='float')
    final_vec = np.sum(vecs, axis=0)
return final_vec
```
words is list of tokens obtained after tokenizing a sentence.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...