Error in extracting phrases using Gensim

问题

I am trying to get the bigrams in the sentences using Phrases in Gensim as follows.

from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents = ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"]

sentence_stream = [doc.split(" ") for doc in documents]
#print(sentence_stream)
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)

for sent in sentence_stream:
    tokens_ = bigram_phraser[sent]
    print(tokens_)

Even though it catches "new", "york" as "new york", it does not catch "machine", learning as "machine learning"

However, in the example shown in Gensim Website they were able to catch the words "machine", "learning" as "machine learning".

Please let me know how to get "machine learning" as a bigram in the above example

回答1:

The technique used by gensim Phrases is purely based on statistics of co-occurrences: how often words appear together, versus alone, in a formula also affected by min_count and compared against the threshold value.

It is only because your training set has 'new' and 'york' occur alongside each other twice, while other words (like 'machine' and 'learning') only occur alongside each other once, that 'new_york' becomes a bigram, and other pairings do not. What's more, even if you did find a combination of min_count and threshold that would promote 'machine_learning' to a bigram, it would also pair together every other bigram-that-appears-once – which is probably not what you want.

Really, to get good results from these statistical techniques, you need lots of varied, realistic data. (Toy-sized examples may superficially succeed, or fail, for superficial toy-sized reasons.)

Even then, they will tend to miss combinations a person would consider reasonable, and make combinations a person wouldn't. Why? Because our minds have much more sophisticated ways (including grammar and real-world knowledge) for deciding when clumps of words represent a single concept.

So even with more better data, be prepared for nonsensical n-grams. Tune or judge the model on whether it is overall improving on your goal, not any single point or ad-hoc check of matching your own sensibility.

(Regarding the referenced gensim documentation comment, I'm pretty sure that if you try Phrases on just the two sentences listed there, it won't find any of the desired phrases – not 'new_york' or 'machine_learning'. As a figurative example, the ellipses ... imply the training set is larger, and the results indicate that the extra unshown texts are important. It's just because of the 3rd sentence you've added to your code that 'new_york' is detected. If you added similar examples to make 'machine_learning' look more like a statistically-outlying pairing, your code could promote 'machine_learning', too.)