问题
I am running into this problem w/ Apache Arrow Spark Integration.
Using AWS EMR w/ Spark 2.4.3
Tested this problem on both local spark single machine instance and a Cloudera cluster and everything works fine.
set these in spark-env.sh
export PYSPARK_PYTHON=python3
export PYSPARK_PYTHON_DRIVER=python3
confirmed this in spark shell
spark.version
2.4.3
sc.pythonExec
python3
SC.pythonVer
python3
running basic pandas_udf with apache arrow integration results in error
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
# pdf is a pandas.DataFrame
v = pdf.v
return pdf.assign(v=v - v.mean())
df.groupby("id").apply(subtract_mean).show()
error on aws emr [doesn't error on cloudera and local machine]
ModuleNotFoundError: No module named 'pyarrow'
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:291)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:283)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Anyone have an idea what is going on? some possible ideas ...
Could PYTHONPATH be causing a problem because I am not using anaconda
?
Does it have to do with the Spark Version and Arrow Version?
This is the strangest thing because I am using the same versions across within all 3 platforms [local desktop, cloudera, emr] and only EMR is not working ...
I logged into all 4 EMR EC2 data nodes and tested that I can importpyarrow
and it works totally fine but not when trying to use it with spark
# test
import numpy as np
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({'one': [20, np.nan, 2.5],'two': ['january', 'february', 'march'],'three': [True, False, True]},index=list('abc'))
table = pa.Table.from_pandas(df)
回答1:
In EMR python3 is not resolved by default. You have to make it explicit. One way to do it is to pass a config.json
file as you're creating the cluster. It's available in the Edit software settings
section in AWS EMR UI. A sample json file looks something like this.
[
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}
]
},
{
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}
]
}
]
Also you need to have the pyarrow
module installed in all core nodes, not only in the master. For that you can use a bootstrap script while creating the cluster in AWS. Again, a sample bootstrap script can be as simple as something like this:
#!/bin/bash
sudo python3 -m pip install pyarrow==0.13.0
回答2:
There are two options in your case:
one is to make sure the python env is correct on every machines:
set the
PYSPARK_PYTHON
to your python interpreter that has installed the third part module such aspyarrow
. you can usetype -a python
to check how many python there is on your slave node.if the python interpreter path are all the same on every nodes, you can set
PYSPARK_PYTHON
inspark-env.sh
then copy to every other nodes. read this for more: https://spark.apache.org/docs/2.4.0/spark-standalone.html
another option is to add argument on spark-submit
:
you have to package your extra module to a
zip
oregg
file first.then type
spark-submit --py-files pyarrow.zip your_code.py
. in this way, spark will transport your module automatically to every other nodes. https://spark.apache.org/docs/latest/submitting-applications.html
I hope these helped.
来源:https://stackoverflow.com/questions/57315030/aws-emr-modulenotfounderror-no-module-named-pyarrow