Does WikiCorpus from gensim library works on Arabic Wikipedia dump?

丶灬走出姿态 提交于 2021-02-11 14:45:22

问题


I see a code which uses Wikicorpus on an Arabic Wikipedia dump, and I know that the process will take a long time to execute, I also searched around about the warning that I get when executing it which says:

(UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial"))

and answers said that it's ok, nothing serious, it's just a warning. But after waiting about 3 days without any response! I start wondering whether is it truly work on the Arabic dump file, or I have to do certain kind of pre-processing before passing the Arabic dump file to the Wikicorpus object? the data size is about 989.6 MB. and I surround the WikiCorpus code line with two print commands, to know when it started and when it finished executing, like this:

print('start WikiCorpus')
wiki = WikiCorpus(self.in_f)
print('finish WikiCorpus')

where the self.in_f is the Arabic Wikipedia dump like this: (/the path where the file located/arwiki-20200201-pages-articles.xml.bz2), but never reached the second print command during the runtime.


回答1:


It should work, especially if Arabic has clear word-delimiters (like spaces between words).

However, lots of things are harder on Windows, given that gensim & most related Python data-science libraries get more development/testing/use elsewhere, & there are some Windows-specific oddities with multiprocessing. If you have the option of working on another OS, that can make things easier.

There was another recent question describing a similar problem with an en dump & WikiCorpus – there are ideas of things to check in my answer there, though it's unclear if the asker ever resolved the problem.

Also, when using code that relies on Python multiprocessing in Windows, it may be especially necessary to set your code off in a 'main' block that's won't be re-run if your file is re-imported by other processes, and call a Windows-specific freeze_support() function. See some recent discussion of a related matter on the gensim project list.



来源:https://stackoverflow.com/questions/60451614/does-wikicorpus-from-gensim-library-works-on-arabic-wikipedia-dump

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!