问题
I need to calculate word co-occurrence statistics for some 10,000 target words and few hundred context words, for each target word, from n-gram corpus of google books
Below is the link of the full dataset:
Google Ngram Viewer
As evident database is approximately of 2.2TB and contains few hundred billions of rows. For computing word co-occurrence statistics I need to process the whole data for each possible pair of target and context word . I am currently considering using Hadoop with Hive for batch processing of data. What are the other viable options considering this is an academic project with time constraints of a semester and limited availability of computational resources.
Note that real time querying on the data is not required
回答1:
Hive has a built-in UDF for handling ngrams https://cwiki.apache.org/Hive/statisticsanddatamining.html#StatisticsAndDataMining-ngrams%2528%2529andcontextngrams%2528%2529%253ANgramfrequencyestimation
来源:https://stackoverflow.com/questions/15249489/what-are-the-most-feasible-options-to-do-processing-on-google-books-n-gram-datas