PySpark: An error occurred while calling o51.showString. No module named XXX

问题

My pyspark version is 2.2.0. I came to a strange problem. I try to simplify it as the following. The files structure:

|root
|-- cast_to_float.py
|-- tests
    |-- test.py

In cast_to_float.py, my code:

from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf

def cast_to_float(y, column_name):
    return y.withColumn(column_name, y[column_name].cast(FloatType()))

def cast_to_float_1(y, column_name):
    to_float = udf(cast2float1, FloatType())
    return y.withColumn(column_name, to_float(column_name))

def cast2float1(a):
    return 1.0

In test.py:

from pyspark.sql import SparkSession
import os
import sys
parentPath = os.path.abspath('..')
if parentPath not in sys.path:
    sys.path.insert(0, parentPath)

from cast_to_float import *
spark = SparkSession.builder.appName("tests").getOrCreate()
df = spark.createDataFrame([
            (1, 1),
            (2, 2),
            (3, 3),
        ], ["ID", "VALUE"])
df1 = cast_to_float(df, 'ID')
df2 = cast_to_float_1(df, 'ID')

df1.show()
df1.printSchema()
df2.printSchema()
df2.show()

Then I run the test in tests folder, I get the error message, which is from the last line, saying:

+---+-----+
| ID|VALUE|
+---+-----+
|1.0|    1|
|2.0|    2|
|3.0|    3|
+---+-----+

root
 |-- ID: float (nullable = true)
 |-- VALUE: long (nullable = true)

root
 |-- ID: float (nullable = true)
 |-- VALUE: long (nullable = true)

    Py4JJavaError                             Traceback (most recent call last)
<ipython-input-4-86eb5df2f917> in <module>()
     19 df1.printSchema()
     20 df2.printSchema()
---> 21 df2.show()
...
Py4JJavaError: An error occurred while calling o257.showString.
...
ModuleNotFoundError: No module named 'cast_to_float'
...

It seems the cast_to_float is imported, otherwise, I cannot get df1 even.

If I put test.py in the same directory of cast_to_float.py, and run it in that directory, then it's OK. Any ideas? Thanks!

I used @user8371915 __file__ method, and found it's OK if I ran it in root folder.

回答1:

As it is right now, the result will depend on the working directory, where you invoke the script.

If you're in root, this will add its parent. You should use path relative to __file__ (see what does the __file__ variable mean/do?):

parentPath = os.path.join(
    os.path.abspath(os.path.dirname(__file__)), 
    os.path.pardir
)

but I'd will recommend using proper package structure.

Note:

This covers only local mode and driver path and even in local mode, worker paths, are not affected by the driver path.

To handle executor paths (after changes you get executor exceptions) you should still distribute modules to the workers How to use custom classes with Apache Spark (pyspark)?.

spark = SparkSession.builder.appName("tests").getOrCreate()
spark.sparkContext.addPyFile("/path/to/cast_to_float.py")

来源：https://stackoverflow.com/questions/48504849/pyspark-an-error-occurred-while-calling-o51-showstring-no-module-named-xxx

标签

python

apache-spark

pyspark

pyspark-sql