fastText与Word2Vec之间的比较

本篇文章总结了试验fastText跟Word2Vec在embedding上的performance，源于这篇文章。

Dataset

训练embedding的数据集有两个，一个是text8 corpus，一个是nltk自带的brown corpus。Groundtruth是questtions-words文本，可以从这里下载。

text8 Corpus Download

1	wget http://mattmahoney.net/dc/text8.zip

brown Corpus Download

12345678

import nltk# 从当中选择brown corpus进行下载nltk.download()# Generate brown corpus text filewith open('brown_corp.txt', 'w+') as f:    for word in nltk.corpus.brown.words():        f.write('{word} '.format(word=word))

Model Training

用fastText和Word2Vec分别对上述两个数据集进行训练，得到word embeddings。

fastText Training

下载fastText源码，对上述两个数据集进行训练。

12	./fasttext skipgram -input brown_corp.txt -output brown_ft./fasttext skipgram -input text8 -output text8_ft

Word2Vec Training

Word2Vec的训练基于gensim，采用logging来对过程进行输出。

123456789101112131415

from nltk.corpus import brownfrom gensim.models import Word2Vecfrom gensim.models.word2vec import Text8Corpusimport logginglogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')logging.root.setLevel(level=logging.INFO)MODELS_DIR = 'models/'brown_gs = Word2Vec(brown.sents())brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec')text8_gs = Word2Vec(Text8Corpus('text8'))text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec')

Comparison

用questions-words.txt提供的数据作为Groundtruth，从semantic和syntactic两方面来对两种embedding的方法进行比较。

Based on Brown Corpus

12345678910111213141516171819202122232425262728293031

from gensim.models import Word2Vecdef print_accuracy(model, questions_file):    print('Evaluating...n')    acc = model.accuracy(questions_file)    for section in acc:        correct = len(section['correct'])        total = len(section['correct']) + len(section['incorrect'])        total = total if total else 1        accuracy = 100*float(correct)/total        print('{:d}/{:d}, {:.2f}%, Section: {:s}'.format(correct, total, accuracy, section['section']))    sem_correct = sum((len(acc[i]['correct']) for i in range(5)))    sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))    print('nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total))    syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))    syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))    print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%n'.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total))MODELS_DIR = 'models/'word_analogies_file = 'questions-words.txt'print('nLoading FastText embeddings')ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')print('Accuracy for FastText:')print_accuracy(ft_model, word_analogies_file)print('nLoading Gensim embeddings')gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')print('Accuracy for word2vec:')print_accuracy(gs_model, word_analogies_file)

结果如下：

Loading FastText embeddings
Accuracy for FastText:
Evaluating…

0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
36/182, 19.78%, Section: family
498/702, 70.94%, Section: gram1-adjective-to-adverb
110/132, 83.33%, Section: gram2-opposite
675/1056, 63.92%, Section: gram3-comparative
140/210, 66.67%, Section: gram4-superlative
426/650, 65.54%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
153/1260, 12.14%, Section: gram7-past-tense
318/552, 57.61%, Section: gram8-plural
245/342, 71.64%, Section: gram9-plural-verbs
2601/5086, 51.14%, Section: total

Semantic: 36/182, Accuracy: 19.78%
Syntactic: 2565/4904, Accuracy: 52.30%

Loading Gensim embeddings
Accuracy for word2vec:
Evaluating…

0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
54/182, 29.67%, Section: family
8/702, 1.14%, Section: gram1-adjective-to-adverb
0/132, 0.00%, Section: gram2-opposite
72/1056, 6.82%, Section: gram3-comparative
0/210, 0.00%, Section: gram4-superlative
14/650, 2.15%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
28/1260, 2.22%, Section: gram7-past-tense
4/552, 0.72%, Section: gram8-plural
8/342, 2.34%, Section: gram9-plural-verbs
188/5086, 3.70%, Section: total

Semantic: 54/182, Accuracy: 29.67%
Syntactic: 134/4904, Accuracy: 2.73%

从运行结果可以看到，fastText的semantic accuracy比Word2Vec要稍微差一点儿，但是Syntactic accuracy的效果明显优于Word2Vec。这是因为1中提到，fastText中word embeddings是由他们的n-gram embeddings来表示，所以形态上相似的词的embeddings也会比较类似。比如：
$$
embedding(amazing)-embedding(amazingly) = embedding(calm)-embedding(calmly).
$$

Based on text8 Corpus

123456789

print('Loading FastText embeddings')ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')print('Accuracy for FastText:')print_accuracy(ft_model, word_analogies_file)print('Loading Gensim embeddings')gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')print('Accuracy for word2vec:')print_accuracy(gs_model, word_analogies_file)

结果如下：

Loading FastText embeddings
Accuracy for FastText:
Evaluating…

322/506, 63.64%, Section: capital-common-countries
609/1452, 41.94%, Section: capital-world
36/268, 13.43%, Section: currency
286/1520, 18.82%, Section: city-in-state
134/306, 43.79%, Section: family
556/756, 73.54%, Section: gram1-adjective-to-adverb
186/306, 60.78%, Section: gram2-opposite
838/1260, 66.51%, Section: gram3-comparative
270/506, 53.36%, Section: gram4-superlative
556/992, 56.05%, Section: gram5-present-participle
1293/1371, 94.31%, Section: gram6-nationality-adjective
490/1332, 36.79%, Section: gram7-past-tense
888/992, 89.52%, Section: gram8-plural
365/650, 56.15%, Section: gram9-plural-verbs
6829/12217, 55.90%, Section: total

Semantic: 1387/4052, Accuracy: 34.23%
Syntactic: 5442/8165, Accuracy: 66.65%

Loading Gensim embeddings
Accuracy for word2vec:
Evaluating…

153/506, 30.24%, Section: capital-common-countries
248/1452, 17.08%, Section: capital-world
27/268, 10.07%, Section: currency
172/1571, 10.95%, Section: city-in-state
218/306, 71.24%, Section: family
88/756, 11.64%, Section: gram1-adjective-to-adverb
45/306, 14.71%, Section: gram2-opposite
716/1260, 56.83%, Section: gram3-comparative
179/506, 35.38%, Section: gram4-superlative
325/992, 32.76%, Section: gram5-present-participle
702/1371, 51.20%, Section: gram6-nationality-adjective
343/1332, 25.75%, Section: gram7-past-tense
401/992, 40.42%, Section: gram8-plural
219/650, 33.69%, Section: gram9-plural-verbs
3836/12268, 31.27%, Section: total

Semantic: 818/4103, Accuracy: 19.94%
Syntactic: 3018/8165, Accuracy: 36.96%

实验结果可以看出，用在较大的数据集上，fastText的优势表现得更加明显，当然word2vec的Syntactic accuracy提高得也比较明显。所以总的来看，fastText比word2vec在word embedding上更好，特别是对于syntactic information。

实验中用到的Hyperparameters

Gensim word2vec和fastText用了相似的参数，dim_size = 100, window_size = 5, num_epochs = 5。但是它们的模型完全不同，尽管有很多相似性。

Reference

[1]Enriching Word Vectors with Subword Information
[2]Efficient Estimation of Word Representations in Vector Space

Permalink： http://cuiyungao.github.io/2016/08/09/fastvsword/