To put my question in context, I would like to train and test/compare several (neural) language models. In order to focus on the models rather than data preparation I chose to u
You are getting a low perplexity because you are using a pentagram model. If you'd use a bigram model your results will be in more regular ranges of about 50-1000 (or about 5 to 10 bits).
Given your comments, are you using NLTK-3.0alpha? You shouldn't, at least not for language modeling:
https://github.com/nltk/nltk/issues?labels=model
As a matter of fact, the whole model
module has been dropped from the NLTK-3.0a4 pre-release until the issues are fixed.