What are the most feasible options to do processing on google books n-gram dataset using modest resources?
问题 I need to calculate word co-occurrence statistics for some 10,000 target words and few hundred context words, for each target word, from n-gram corpus of google books Below is the link of the full dataset: Google Ngram Viewer As evident database is approximately of 2.2TB and contains few hundred billions of rows. For computing word co-occurrence statistics I need to process the whole data for each possible pair of target and context word . I am currently considering using Hadoop with Hive for