How to improve the performance when working with wikipedia data and huge no. of webpages?
问题 I am supposed to extract representative terms from an organisation's website using wikipedia's article-link data dump. To achieve this I've - Crawled & downloaded organisation's webpages. (~110,000) Created a dictionary of wikipedia ID and terms/title. (~40million records) Now, I'm supposed to process each of the webpages using the dictionary to recognise terms and track their term IDs & frequencies. For the dictionary to fit in memory, I've splitted the dictionary into smaller files. Based