问题
My pyspark version is 2.2.0. I came to a strange problem. I try to simplify it as the following. The files structure:
|root
|-- cast_to_float.py
|-- tests
|-- test.py
In cast_to_float.py
, my code:
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
def cast_to_float(y, column_name):
return y.withColumn(column_name, y[column_name].cast(FloatType()))
def cast_to_float_1(y, column_name):
to_float = udf(cast2float1, FloatType())
return y.withColumn(column_name, to_float(column_name))
def cast2float1(a):
return 1.0
In test.py
:
from pyspark.sql import SparkSession
import os
import sys
parentPath = os.path.abspath('..')
if parentPath not in sys.path:
sys.path.insert(0, parentPath)
from cast_to_float import *
spark = SparkSession.builder.appName("tests").getOrCreate()
df = spark.createDataFrame([
(1, 1),
(2, 2),
(3, 3),
], ["ID", "VALUE"])
df1 = cast_to_float(df, 'ID')
df2 = cast_to_float_1(df, 'ID')
df1.show()
df1.printSchema()
df2.printSchema()
df2.show()
Then I run the test in tests folder, I get the error message, which is from the last line, saying:
+---+-----+
| ID|VALUE|
+---+-----+
|1.0| 1|
|2.0| 2|
|3.0| 3|
+---+-----+
root
|-- ID: float (nullable = true)
|-- VALUE: long (nullable = true)
root
|-- ID: float (nullable = true)
|-- VALUE: long (nullable = true)
Py4JJavaError Traceback (most recent call last)
<ipython-input-4-86eb5df2f917> in <module>()
19 df1.printSchema()
20 df2.printSchema()
---> 21 df2.show()
...
Py4JJavaError: An error occurred while calling o257.showString.
...
ModuleNotFoundError: No module named 'cast_to_float'
...
It seems the cast_to_float
is imported, otherwise, I cannot get df1
even.
If I put test.py
in the same directory of cast_to_float.py
, and run it in that directory, then it's OK. Any ideas? Thanks!
I used @user8371915 __file__
method, and found it's OK if I ran it in root
folder.
回答1:
As it is right now, the result will depend on the working directory, where you invoke the script.
If you're in root, this will add its parent. You should use path relative to __file__
(see what does the __file__ variable mean/do?):
parentPath = os.path.join(
os.path.abspath(os.path.dirname(__file__)),
os.path.pardir
)
but I'd will recommend using proper package structure.
Note:
This covers only local mode and driver path and even in local mode, worker paths, are not affected by the driver path.
To handle executor paths (after changes you get executor exceptions) you should still distribute modules to the workers How to use custom classes with Apache Spark (pyspark)?.
spark = SparkSession.builder.appName("tests").getOrCreate()
spark.sparkContext.addPyFile("/path/to/cast_to_float.py")
来源:https://stackoverflow.com/questions/48504849/pyspark-an-error-occurred-while-calling-o51-showstring-no-module-named-xxx