What are the most feasible options to do processing on google books n-gram dataset using modest resources?

假装没事ソ 提交于 2019-12-24 20:23:29

问题


I need to calculate word co-occurrence statistics for some 10,000 target words and few hundred context words, for each target word, from n-gram corpus of google books

Below is the link of the full dataset:

Google Ngram Viewer

As evident database is approximately of 2.2TB and contains few hundred billions of rows. For computing word co-occurrence statistics I need to process the whole data for each possible pair of target and context word . I am currently considering using Hadoop with Hive for batch processing of data. What are the other viable options considering this is an academic project with time constraints of a semester and limited availability of computational resources.

Note that real time querying on the data is not required


回答1:


Hive has a built-in UDF for handling ngrams https://cwiki.apache.org/Hive/statisticsanddatamining.html#StatisticsAndDataMining-ngrams%2528%2529andcontextngrams%2528%2529%253ANgramfrequencyestimation



来源:https://stackoverflow.com/questions/15249489/what-are-the-most-feasible-options-to-do-processing-on-google-books-n-gram-datas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!