问题
I am spinning up an EMR in AWS. The difficulty arises when using Jupyter to import associated Python modules. I have a shell script that executes when the EMR starts and imports Python modules.
The notebook is set to run using the PySpark Kernel.
I believe the problem is that the Jupyter notebook is not pointed to the correct Python in EMR. The methods I have used to set the notebook to the correct version do not seem to work.
I have set the following configurations. I have tried changing python to python3.6 and python3.
Configurations=[{
"Classification": "spark-env",
"Properties": {},
"Configurations": [{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "python",
"PYSPARK_DRIVER_PYTHON": "python",
"SPARK_YARN_USER_ENV": "python"
}
}]
I am certain that my shell script is importing the modules because when I run the following on the EMR command line (via SSH) it works:
python3.6
import boto3
However when I run the following, it does not work:
python
import boto3
Traceback (most recent call last): File "", line 1, in ImportError: No module named boto3
When I run the following command in Jupyter I get the output below:
import sys
import os
print(sys.version)
2.7.16 (default, Jul 19 2019, 22:59:28) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]
#!/bin/bash
alias python=python3.6
export PYSPARK_DRIVER_PYTHON="python"
export SPARK_YARN_USER_ENV="python"
sudo python3 -m pip install boto3
sudo python3 -m pip install pandas
sudo python3 -m pip install pymysql
sudo python3 -m pip install xlrd
sudo python3 -m pip install pymssql
When I attempt to import boto3 I get an error message using Jupyter:
No module named boto3 Traceback (most recent call last): ImportError: No module named boto3
回答1:
If you want to use Python3 with EMR notebooks, the recommended way is to use pyspark kernel and configure Spark to use Python3 within the notebook as,
%%configure -f {"conf":{ "spark.pyspark.python": "python3" }}
Note that,
Any on cluster configuration related to PYSPARK_PYTHON or PYSPARK_PYTHON_DRIVER is overridden by EMR notebook configuration. The only way to configure for Python3 is from within the notebook as mentioned above.
pyspark3 kernel is deprecated for Livy 4.0+, and henceforth pyspark kernel is recommended to be used for both Python2 and Python3 by configuring spark.pyspark.python accordingly.
If you want to install additional Python dependencies which are not already present on the cluster, you can use notebook-scoped libraries. It works for both Python2 as well as Python3.
来源:https://stackoverflow.com/questions/57512577/how-to-set-jupyter-notebook-to-python3-instead-of-python2-7-in-aws-emr