How can I include a python package with Hadoop streaming job?

后端 未结 5 1076
情深已故
情深已故 2020-11-27 13:47

I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, \"-file\"

相关标签:
5条回答
  • 2020-11-27 13:59

    An example of loading external python package nltk
    refer to the answer
    Running extrnal python lib like (NLTK) with hadoop streaming
    I followed following approach and ran the nltk package in with hadoop streaming successfully.

    Assumption, you have already your package or (nltk in my case)in your system

    first:

    zip -r nltk.zip nltk
    mv ntlk.zip /place/it/anywhere/you/like/nltk.mod
    

    Why any where will work?
    Ans :- Because we will provide path to this .mod zipped file through command line, we don't need to worry much about it.

    second:
    changes in your mapper or .py file

    #Hadoop cannot unzip files by default thus you need to unzip it   
    import zipimport
    importer = zipimport.zipimporter('nltk.mod')
    nltk = importer.load_module('nltk')
    
    #now import what ever you like from nltk
    from nltk import tree
    from nltk import load_parser
    from nltk.corpus import stopwords
    nltk.data.path += ["."]
    

    third: command line argument to run map-reduce

    hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -file /your/path/to/mapper/mapper.py \
    -mapper '/usr/local/bin/python3.4 mapper.py' \
    -file /your/path/to/reducer/reducer.py \
    -reducer '/usr/local/bin/python3.4 reducer.py' \
    -file /your/path/to/nltkzippedmodfile/nltk.mod \
    -input /your/path/to/HDFS/input/check.txt -output /your/path/to/HDFS/output/
    

    Thus, above step solved my problem and I think it should solve others as well.
    cheers,

    0 讨论(0)
  • 2020-11-27 14:02

    If you are using much more complex libs such as numpy、pandas, virtualenv is a better way. You can add -archives to send the env to cluster.

    Refer to the writing: https://henning.kropponline.de/2014/07/18/virtualenv-hadoop-streaming/

    Updated:

    I tried above virtualenv in our online env, and find some problems.In the cluster,there is some errors like "Could not find platform independent libraries "。Then i tried the conda to create python env, it worked well.

    If you are Chinese, you can look this:https://blog.csdn.net/Jsin31/article/details/53495423

    If not, i can translate it briefly:

    1. create an env by conda:

      conda create -n test python=2.7.12 numpy pandas

    2. Go to the conda env path.You can find it by cmd:

      conda env list

      Then,you can pack it:

      tar cf test.tar test

    3. submit the job through hadoop stream:
    hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
    -archives test.tar \
    -input /user/testfiles \
    -output /user/result \ 
    -mapper "test.tar/test/bin/python mapper.py" \
    -file mapper.py \
    -reducer"test.tar/test/bin/python reducer.py" \
    -file reducer.py
    
    0 讨论(0)
  • 2020-11-27 14:03

    I would zip up the package into a .tar.gz or a .zip and pass the entire tarball or archive in a -file option to your hadoop command. I've done this in the past with Perl but not Python.

    That said, I would think this would still work for you if you use Python's zipimport at http://docs.python.org/library/zipimport.html, which allows you to import modules directly from a zip.

    0 讨论(0)
  • 2020-11-27 14:10

    You can use zip lib like this:

    import sys
    sys.path.insert(0, 'nltkandyaml.mod')
    import ntlk
    import yaml
    
    0 讨论(0)
  • 2020-11-27 14:14

    Just came across this gem of a solution: http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

    first create zip w/ the libraries desired

    zip -r nltkandyaml.zip nltk yaml
    mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod
    

    next, include via Hadoop stream "-file" argument:

    hadoop -file nltkandyaml.zip
    

    finally, load the libaries via python:

    import zipimport
    importer = zipimport.zipimporter('nltkandyaml.mod')
    yaml = importer.load_module('yaml')
    nltk = importer.load_module('nltk') 
    

    Additionally, this page summarizes how to include a corpus: http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

    download and unzip the wordnet corpus

    cd wordnet
    zip -r ../wordnet-flat.zip *
    

    in python:

    wn = WordNetCorpusReader(nltk.data.find('lib/wordnet-flat.zip'))
    
    0 讨论(0)
提交回复
热议问题