How do I prepare to use entire wikipedia for natural language processing?

后端 未结 2 553
夕颜
夕颜 2021-01-25 02:12

I am a bit new here. I have a project where I have to download and use Wikipedia for NLP. The questions I am facing are as follows: I have RAM of only 12 GB, but the English wik

相关标签:
2条回答
  • 2021-01-25 02:20

    If you want to process the XML dumps directly, you can download the multistream version.

    multistream allows the use of an index to decompress sections as needed without having to decompress the entire thing.

    This allows you to pull articles out of a compressed dump.

    For docs, see https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps. Using this information, you can get any given article out of the dumps without needing to load it into a memory.

    If you want to parse all of Wikipedia, you can parse one of the multistream files (~100 articles) at a time, which should make you fit into your resources. An example on how to do it is shown at https://jamesthorne.co.uk/blog/processing-wikipedia-in-a-couple-of-hours/.

    0 讨论(0)
  • 2021-01-25 02:44

    The easiest to process wikipedia dump is to rely on kiwix.org dump that you can find at: https://wiki.kiwix.org/wiki/Content_in_all_languages

    Then using python you can do the following

    % wget http://download.kiwix.org/zim/wiktionary_eo_all_nopic.zim
    ...
    % pip install --user libzim
    % ipython
    In [2]: from libzim.reader import File
    
    In [3]: total = 0
       ...:
       ...: with File("wiktionary_eo_all_nopic.zim") as reader:
       ...:     for uid in range(0, reader.article_count):
       ...:         page = reader.get_article_by_id(uid)
       ...:         total += len(page.content)
       ...: print(total)
    

    This is an simplistic processing, you should get the point to get started. In particular, as of 2020, the raw wikipedia dump using wikimarkup are very difficult to process in the sense you can not convert wikimarkup to html including infoboxes without a full wikimedia setup. There is also the REST API but why struggle when the work is already done :)

    Regarding where to store the data AFTER processing, I think the industry standard is PostgreSQL or ElasticSearch (which also requires lots of memory) but I really like hoply, and more generally OKVS.

    0 讨论(0)
提交回复
热议问题