I am trying to train a word2vec model on very short phrases (5 grams). Since each sentence or example is very short, I believe the window size I can use can atmost be 2. I a
Very low scores on the analogy-questions are more likely due to limitations in the amount or quality of your training data, rather than mistuned parameters. (If your training phrases are really only 5 words each, they may not capture the same rich relations as can be discovered from datasets with full sentences.)
You could use a window of 5 on your phrases – the training code trims the window to what's available on either side – but then every word of each phrase affects all of the other words. That might be OK: one of the Google word2vec papers ("Distributed Representations of Words and Phrases and their Compositionality") mentions that to get the best accuracy on one of their phrase tasks, they used "the entire sentence for the context". (On the other hand, on one English corpus of short messages, I found a window size of just 2 created the vectors that scored best on the analogies-evaluation, so larger isn't necessarily better.)
A paper by Levy & Goldberg, "Dependency-Based Word Embeddings", speaks a bit about the qualitative effect of window-size:
https://levyomer.files.wordpress.com/2014/04/dependency-based-word-embeddings-acl-2014.pdf
They find:
Larger windows tend to capture more topic/domain information: what other words (of any type) are used in related discussions? Smaller windows tend to capture more about word itself: what other words are functionally similar? (Their own extension, the dependency-based embeddings, seems best at finding most-similar words, synonyms or obvious-alternatives that could drop-in as replacements of the origin word.)
To your question: "I am trying to understand what the implications of such a small window size are on the quality of the learned model".
For example "stackoverflow great website for programmers" with 5 words (suppose we save the stop words great and for here) if the window size is 2 then the vector of word "stackoverflow" is directly affected by the word "great" and "website", if the window size is 5 "stackoverflow" can be directly affected by two more words "for" and "programmers". The 'affected' here means it will pull the vector of two words closer.
So it depends on the material you are using for training, if the window size of 2 can capture the context of a word, but 5 is chosen, it will decrease the quality of the learnt model, and vise versa.