问题
I am currently using uni-grams in my word2vec model as follows.
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
#Returns a list of sentences, where each sentence is a list of words
#
#NLTK tokenizer to split the paragraph into sentences
raw_sentences = tokenizer.tokenize(review.strip())
sentences = []
for raw_sentence in raw_sentences:
# If a sentence is empty, skip it
if len(raw_sentence) > 0:
# Otherwise, call review_to_wordlist to get a list of words
sentences.append( review_to_wordlist( raw_sentence, \
remove_stopwords ))
#
# Return the list of sentences (each sentence is a list of words,
# so this returns a list of lists
return sentences
However, then I will miss important bigrams and trigrams in my dataset.
E.g.,
"team work" -> I am currently getting it as "team", "work"
"New York" -> I am currently getting it as "New", "York"
Hence, I want to capture the important bigrams, trigrams etc. in my dataset and input into my word2vec model.
I am new to wordvec and struggling how to do it. Please help me.
回答1:
First of all you should use gensim's class Phrases in order to get bigrams, which works as pointed in the doc
>>> bigram = Phraser(phrases)
>>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
>>> print(bigram[sent])
[u'the', u'mayor', u'of', u'new_york', u'was', u'there']
To get trigrams and so on, you should use the bigram model that you already have and apply Phrases to it again, and so on. Example:
trigram_model = Phrases(bigram_sentences)
Also there is a good notebook and video that explains how to use that .... the notebook, the video
The most important part of it is how to use it in real life sentences which is as follows:
// to create the bigrams
bigram_model = Phrases(unigram_sentences)
// apply the trained model to a sentence
for unigram_sentence in unigram_sentences:
bigram_sentence = u' '.join(bigram_model[unigram_sentence])
// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)
Hope this helps you, but next time give us more information on what you are using and etc.
P.S: Now that you edited it, you are not doing anything in order to get bigrams just splitting it, you have to use Phrases in order to get words like New York as bigrams.
回答2:
from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents =
["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"]
sentence_stream = [doc.split(" ") for doc in documents]
print(sentence_stream)
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
print(bigram_phraser)
for sent in sentence_stream:
tokens_ = bigram_phraser[sent]
print(tokens_)
回答3:
Phrases and Phraser are those you should looking for
bigram = gensim.models.Phrases(data_words, min_count=1, threshold=10) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
Once you are enough done with adding vocabs then use Phraser for faster access and efficient memory usage. Not mandatory but useful.
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
Thanks,
来源:https://stackoverflow.com/questions/46129335/get-bigrams-and-trigrams-in-word2vec-gensim