问题
I see a code which uses Wikicorpus on an Arabic Wikipedia dump, and I know that the process will take a long time to execute, I also searched around about the warning that I get when executing it which says:
(UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial"))
and answers said that it's ok, nothing serious, it's just a warning. But after waiting about 3 days without any response! I start wondering whether is it truly work on the Arabic dump file, or I have to do certain kind of pre-processing before passing the Arabic dump file to the Wikicorpus object? the data size is about 989.6 MB. and I surround the WikiCorpus code line with two print commands, to know when it started and when it finished executing, like this:
print('start WikiCorpus')
wiki = WikiCorpus(self.in_f)
print('finish WikiCorpus')
where the self.in_f is the Arabic Wikipedia dump like this: (/the path where the file located/arwiki-20200201-pages-articles.xml.bz2), but never reached the second print command during the runtime.
回答1:
It should work, especially if Arabic has clear word-delimiters (like spaces between words).
However, lots of things are harder on Windows, given that gensim
& most related Python data-science libraries get more development/testing/use elsewhere, & there are some Windows-specific oddities with multiprocessing. If you have the option of working on another OS, that can make things easier.
There was another recent question describing a similar problem with an en
dump & WikiCorpus
– there are ideas of things to check in my answer there, though it's unclear if the asker ever resolved the problem.
Also, when using code that relies on Python multiprocessing
in Windows, it may be especially necessary to set your code off in a 'main' block that's won't be re-run if your file is re-imported by other processes, and call a Windows-specific freeze_support()
function. See some recent discussion of a related matter on the gensim project list.
来源:https://stackoverflow.com/questions/60451614/does-wikicorpus-from-gensim-library-works-on-arabic-wikipedia-dump