How do I prepare to use entire wikipedia for natural language processing?

后端未结

关注

 2  554

夕颜 2021-01-25 02:12

I am a bit new here. I have a project where I have to download and use Wikipedia for NLP. The questions I am facing are as follows: I have RAM of only 12 GB, but the English wik

2条回答

生来不讨喜 (楼主)

2021-01-25 02:20

If you want to process the XML dumps directly, you can download the multistream version.

multistream allows the use of an index to decompress sections as needed without having to decompress the entire thing.

This allows you to pull articles out of a compressed dump.

For docs, see https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps. Using this information, you can get any given article out of the dumps without needing to load it into a memory.

If you want to parse all of Wikipedia, you can parse one of the multistream files (~100 articles) at a time, which should make you fit into your resources. An example on how to do it is shown at https://jamesthorne.co.uk/blog/processing-wikipedia-in-a-couple-of-hours/.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...