How to import nltk corpus in HDFS when I use hadoop streaming

℡╲_俬逩灬. 提交于 2019-12-10 12:17:35

问题


 I got a little problem I want to use nltk corpus in hdfs,But failed.For example I want to load nltk.stopwords in my python code.
 I use this http://eigenjoy.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

I do all that say,but I don't know how to transform it in my work. My nltk file name is nltk-2.0.1.rc1 my pyam file name is PyYAML.3.0.1 so my commad is:

zip -r nltkandyaml.zip nltk-2.0.1.rc1 PyYAML.3.0.1

then it said "mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod"

My mapper.py save in /home/mapreduce/mapper.py so my command is:

mv ntlkandyaml.zip /home/mapreduce/nltkandyaml.mod

is that right?

then i zip my corpus stopwords:

zip -r /nltk_data/corpora/stopwords-flat.zip *

In my code I use:

importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('PyYAML-3.09')
nltk = importer.load_module('nltk-2.1.0.1rc1')
from nltk.corpus.reader import stopwords
from nltk.corpus.reader import StopWordsCorpusReader
nltk.data.path+=["."]
stopwords = StopWordsCorpusReader(nltk.data.find('lib/stopwords-flat.zip'))

finally I use command:

bin/hadoop jar /home/../streaming/hadoop-0.21.0-streaming.jar -input  
/user/root/input/voa.txt -output /user/root/output -mapper /home/../mapper.py -reducer  
/home/../reducer.py -file /home/../nltkandyaml.mod -file /home/../stopwords-flat.zip

please tell me where I'm wrong

thank you all


回答1:


I'm not entirely clear what your problem / error is, but if you want the contents of stopwords-flat.zip to be available in the current working directory at runtime use the -archives flag rather than -files (which could be your problem as you're using -file).

Hadoop will unpack the named archive file (zip), and the contents will be available as if they were in the local directory of your running mapper:

bin/hadoop jar /home/../streaming/hadoop-0.21.0-streaming.jar \
  -input  /user/root/input/voa.txt 
  -output /user/root/output \
  -mapper /home/../mapper.py \
  -reducer /home/../reducer.py \
  -files /home/../nltkandyaml.mod \
  -archives /home/../stopwords-flat.zip



回答2:


    zip -r [your-nltk-package-name/nltk] nltk.zip

    zip -r [your-yaml-package-name/lib/yaml] yaml.zip

then in your script, add:

    importer = zipimport.zipimporter('nltk.zip')
    importer2=zipimport.zipimporter('yaml.zip')
    yaml = importer2.load_module('yaml')
    nltk = importer.load_module('nltk')

in your command, add:

    -file [path-to-your-zip-file]


来源:https://stackoverflow.com/questions/10716302/how-to-import-nltk-corpus-in-hdfs-when-i-use-hadoop-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!