AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

前端未结

关注

 2  658

醉梦人生

I am running into this problem w/ Apache Arrow Spark Integration.

Using AWS EMR w/ Spark 2.4.3

Tested this problem on both local spark single machine instanc

相关标签:

2条回答

傲寒

2020-11-30 14:07
There are two options in your case:

one is to make sure the python env is correct on every machines:
- set the PYSPARK_PYTHON to your python interpreter that has installed the third part module such as pyarrow. you can use type -a python to check how many python there is on your slave node.
- if the python interpreter path are all the same on every nodes, you can set PYSPARK_PYTHON in spark-env.sh then copy to every other nodes. read this for more: https://spark.apache.org/docs/2.4.0/spark-standalone.html
another option is to add argument on spark-submit:
- you have to package your extra module to a zip or egg file first.
- then typespark-submit --py-files pyarrow.zip your_code.py. in this way, spark will transport your module automatically to every other nodes. https://spark.apache.org/docs/latest/submitting-applications.html
I hope these helped.
0 讨论(0)
发布评论:

提交评论
- 加载中...
盖世英雄少女心

2020-11-30 14:09
In EMR python3 is not resolved by default. You have to make it explicit. One way to do it is to pass a config.json file as you're creating the cluster. It's available in the Edit software settings section in AWS EMR UI. A sample json file looks something like this.
```
[
  {
    "Classification": "spark-env",
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "PYSPARK_PYTHON": "/usr/bin/python3"
        }
      }
    ]
  },
  {
    "Classification": "yarn-env",
    "Properties": {},
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "PYSPARK_PYTHON": "/usr/bin/python3"
        }
      }
    ]
  }
]
```
Also you need to have the pyarrow module installed in all core nodes, not only in the master. For that you can use a bootstrap script while creating the cluster in AWS. Again, a sample bootstrap script can be as simple as something like this:
```
#!/bin/bash
sudo python3 -m pip install pyarrow==0.13.0
```
0 讨论(0)
发布评论:

提交评论
- 加载中...