本篇文章总结了试验fastText跟Word2Vec在embedding上的performance,源于这篇文章。
Dataset
训练embedding的数据集有两个,一个是text8
corpus,一个是nltk自带的brown
corpus。Groundtruth是questtions-words
文本,可以从这里下载。
text8 Corpus Download
1 | wget http://mattmahoney.net/dc/text8.zip |
brown Corpus Download
12345678 | import nltk# 从当中选择brown corpus进行下载nltk.download()# Generate brown corpus text filewith open('brown_corp.txt', 'w+') as f: for word in nltk.corpus.brown.words(): f.write('{word} '.format(word=word)) |
Model Training
用fastText和Word2Vec分别对上述两个数据集进行训练,得到word embeddings。
fastText Training
下载fastText源码,对上述两个数据集进行训练。
12 | ./fasttext skipgram -input brown_corp.txt -output brown_ft./fasttext skipgram -input text8 -output text8_ft |
Word2Vec Training
Word2Vec的训练基于gensim,采用logging
来对过程进行输出。
123456789101112131415 | from nltk.corpus import brownfrom gensim.models import Word2Vecfrom gensim.models.word2vec import Text8Corpusimport logginglogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')logging.root.setLevel(level=logging.INFO)MODELS_DIR = 'models/'brown_gs = Word2Vec(brown.sents())brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec')text8_gs = Word2Vec(Text8Corpus('text8'))text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec') |
Comparison
用questions-words.txt
提供的数据作为Groundtruth,从semantic
和syntactic
两方面来对两种embedding的方法进行比较。
Based on Brown Corpus
12345678910111213141516171819202122232425262728293031 | from gensim.models import Word2Vecdef print_accuracy(model, questions_file): print('Evaluating...n') acc = model.accuracy(questions_file) for section in acc: correct = len(section['correct']) total = len(section['correct']) + len(section['incorrect']) total = total if total else 1 accuracy = 100*float(correct)/total print('{:d}/{:d}, {:.2f}%, Section: {:s}'.format(correct, total, accuracy, section['section'])) sem_correct = sum((len(acc[i]['correct']) for i in range(5))) sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5)) print('nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total)) syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1))) syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1)) print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%n'.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total))MODELS_DIR = 'models/'word_analogies_file = 'questions-words.txt'print('nLoading FastText embeddings')ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')print('Accuracy for FastText:')print_accuracy(ft_model, word_analogies_file)print('nLoading Gensim embeddings')gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')print('Accuracy for word2vec:')print_accuracy(gs_model, word_analogies_file) |
结果如下:
Loading FastText embeddings
Accuracy for FastText:
Evaluating…
0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
36/182, 19.78%, Section: family
498/702, 70.94%, Section: gram1-adjective-to-adverb
110/132, 83.33%, Section: gram2-opposite
675/1056, 63.92%, Section: gram3-comparative
140/210, 66.67%, Section: gram4-superlative
426/650, 65.54%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
153/1260, 12.14%, Section: gram7-past-tense
318/552, 57.61%, Section: gram8-plural
245/342, 71.64%, Section: gram9-plural-verbs
2601/5086, 51.14%, Section: total
Semantic: 36/182, Accuracy: 19.78%
Syntactic: 2565/4904, Accuracy: 52.30%
Loading Gensim embeddings
Accuracy for word2vec:
Evaluating…
0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
54/182, 29.67%, Section: family
8/702, 1.14%, Section: gram1-adjective-to-adverb
0/132, 0.00%, Section: gram2-opposite
72/1056, 6.82%, Section: gram3-comparative
0/210, 0.00%, Section: gram4-superlative
14/650, 2.15%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
28/1260, 2.22%, Section: gram7-past-tense
4/552, 0.72%, Section: gram8-plural
8/342, 2.34%, Section: gram9-plural-verbs
188/5086, 3.70%, Section: total
Semantic: 54/182, Accuracy: 29.67%
Syntactic: 134/4904, Accuracy: 2.73%
从运行结果可以看到,fastText的semantic accuracy比Word2Vec要稍微差一点儿,但是Syntactic accuracy的效果明显优于Word2Vec。这是因为1中提到,fastText中word embeddings是由他们的n-gram embeddings来表示,所以形态上相似的词的embeddings也会比较类似。比如:
$$
embedding(amazing)-embedding(amazingly) = embedding(calm)-embedding(calmly).
$$
Based on text8 Corpus
123456789 | print('Loading FastText embeddings')ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')print('Accuracy for FastText:')print_accuracy(ft_model, word_analogies_file)print('Loading Gensim embeddings')gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')print('Accuracy for word2vec:')print_accuracy(gs_model, word_analogies_file) |
结果如下:
Loading FastText embeddings
Accuracy for FastText:
Evaluating…
322/506, 63.64%, Section: capital-common-countries
609/1452, 41.94%, Section: capital-world
36/268, 13.43%, Section: currency
286/1520, 18.82%, Section: city-in-state
134/306, 43.79%, Section: family
556/756, 73.54%, Section: gram1-adjective-to-adverb
186/306, 60.78%, Section: gram2-opposite
838/1260, 66.51%, Section: gram3-comparative
270/506, 53.36%, Section: gram4-superlative
556/992, 56.05%, Section: gram5-present-participle
1293/1371, 94.31%, Section: gram6-nationality-adjective
490/1332, 36.79%, Section: gram7-past-tense
888/992, 89.52%, Section: gram8-plural
365/650, 56.15%, Section: gram9-plural-verbs
6829/12217, 55.90%, Section: total
Semantic: 1387/4052, Accuracy: 34.23%
Syntactic: 5442/8165, Accuracy: 66.65%
Loading Gensim embeddings
Accuracy for word2vec:
Evaluating…
153/506, 30.24%, Section: capital-common-countries
248/1452, 17.08%, Section: capital-world
27/268, 10.07%, Section: currency
172/1571, 10.95%, Section: city-in-state
218/306, 71.24%, Section: family
88/756, 11.64%, Section: gram1-adjective-to-adverb
45/306, 14.71%, Section: gram2-opposite
716/1260, 56.83%, Section: gram3-comparative
179/506, 35.38%, Section: gram4-superlative
325/992, 32.76%, Section: gram5-present-participle
702/1371, 51.20%, Section: gram6-nationality-adjective
343/1332, 25.75%, Section: gram7-past-tense
401/992, 40.42%, Section: gram8-plural
219/650, 33.69%, Section: gram9-plural-verbs
3836/12268, 31.27%, Section: total
Semantic: 818/4103, Accuracy: 19.94%
Syntactic: 3018/8165, Accuracy: 36.96%
实验结果可以看出,用在较大的数据集上,fastText的优势表现得更加明显,当然word2vec的Syntactic accuracy提高得也比较明显。所以总的来看,fastText比word2vec在word embedding上更好,特别是对于syntactic information。
实验中用到的Hyperparameters
Gensim word2vec和fastText用了相似的参数,dim_size = 100, window_size = 5, num_epochs = 5。但是它们的模型完全不同,尽管有很多相似性。
Reference
[1]Enriching Word Vectors with Subword Information
[2]Efficient Estimation of Word Representations in Vector Space
Permalink: http://cuiyungao.github.io/2016/08/09/fastvsword/