How to bootstrap installation of Python modules on Amazon EMR?

后端 未结 4 1040
孤独总比滥情好
孤独总比滥情好 2020-12-01 07:11

I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). Wha

相关标签:
4条回答
  • 2020-12-01 07:51

    Depending if you are using Python 2 (default in EMR) or Python 3, the pip install command should be different. As recommended in noli's answer, you should create a shell script, upload it to a bucket in S3, and use it as a Bootstrap action.

    For Python 2 (in Jupyter: used as default for pyspark kernel):

    #!/bin/bash -xe
    sudo pip install your_package
    

    For Python 3 (in Jupyter: used as default for Python 3 and pyspark3 kernel):

    #!/bin/bash -xe
    sudo pip-3.4 install your_package
    
    0 讨论(0)
  • 2020-12-01 08:02

    The most straightforward way would be to create a bash script containing your installation commands, copy it to S3, and set a bootstrap action from the console to point to your script.

    Here's an example I'm using in production:

    s3://mybucket/bootstrap/install_python_modules.sh

    #!/bin/bash -xe
    
    # Non-standard and non-Amazon Machine Image Python modules:
    sudo pip install -U \
      awscli            \
      boto              \
      ciso8601          \
      ujson             \
      workalendar
    
    sudo yum install -y python-psycopg2
    
    0 讨论(0)
  • 2020-12-01 08:03

    In short, there are two ways to install packages with pip, depending on the platform. First, you install whatever you need and then you can run your Spark step. The easiest is to use emr-4.0.0 and 'command-runner.jar':

    from boto.emr.step import JarStep
    >>> pip_step=JarStep(name="Command Runner",
    ...             jar="command-runner.jar",
    ...             action_on_failure="CONTINUE",
    ...             step_args=['sudo','pip','install','arrow']
    ... )
    >>> spark_step=JarStep(name="Spark with Command Runner",
    ...                    jar="command-runner.jar",
    ...                    step_args=["spark-submit","/usr/lib/spark/examples/src/main/python/pi.py"]
    ...                    action_on_failure="CONTINUE"
    )
    >>> step_list=conn.add_jobflow_steps(emr.jobflowid, [pip_step,spark_step])
    

    On 2.x and 3.x, you use script-runner.jar in a similar fashion except that you have to specify the full URI for scriptrunner.

    EDIT: Sorry, I didn't see that you wanted to do this through console. You can add the same steps in the console as well. The first step would be a Customer JAR with the same args as above. The second step is a spark step. Hope this helps!

    0 讨论(0)
  • 2020-12-01 08:11

    This post got me started down the right path but ultimately I ended on a different solution.

    boostrap.sh

    #!/bin/bash
    
    sudo python3 -m pip install \
        botocore \
        boto3 \
        ujson \
        warcio \
        beautifulsoup4  \
        lxml
    

    create_emr_cluster.sh

    #!/bin/bash
    
    pem_file="~/.ssh/<your pem file>.pem"
    bootstrap_path="s3://<path without filename>/"
    subnet_id="subnet-<uniuqe subnet id>"
    logs_path="s3://<log directory (optional)>/elasticmapreduce/"
    
    aws s3 cp ./bootstrap.sh $bootstrap_path
    
    ID=$(aws emr create-cluster \
    --name spark-data-processing \
    --use-default-roles \
    --release-label emr-5.30.1 \
    --instance-count 2 \
    --application Name=Spark Name=Hive Name=Ganglia Name=Zeppelin \
    --ec2-attributes KeyName=<your pem file>,SubnetId=$subnet_id \
    --instance-type m4.large \
    --bootstrap-actions Path=${bootstrap_path}bootstrap.sh \
    --query ClusterId \
    --output text \
    --log-uri ${logs_path})
    

    credit to for his help

    0 讨论(0)
提交回复
热议问题