I have an RDD composed of a list of 5 words (5 word n-gram), their count, the number of pages, and the number of documents of form (ngram)\\t(count)\\t(page_count)\\t(book
(ngram)\\t(count)\\t(page_count)\\t(book