Calling another custom Python function from Pyspark UDF

点点圈 提交于 2020-05-15 02:51:04


Suppose you have a file, let's call it and in it:

def nested_f(x):
    return x + 1

def main_f(x):
    return nested_f(x) + 1

You then want to make a UDF out of the main_f function and run it on a dataframe:

import pyspark.sql.functions as fn
import pandas as pd

pdf = pd.DataFrame([[1], [2], [3]], columns=['x'])
df = spark.createDataFrame(pdf)

_udf = fn.udf(main_f, 'int')
df.withColumn('x1', _udf(df['x'])).show()

This works OK if we do this from within the same file as where the two functions are defined ( However, trying to do this from a different file (say produces an error ModuleNotFoundError: No module named ...:

import udfs

_udf = fn.udf(udfs.main_f, 'int')
df.withColumn('x1', _udf(df['x'])).show()

I noticed that if I actually nest the nested_f inside the main_f like this:

def main_f(x):
    def nested_f(x):
        return x + 1

    return nested_f(x) + 1

everything runs OK. However, my goal here is to have the logic nicely separated in multiple functions, which I can also test individually.

I think this can be solved by submitting the file (or a whole zipped folder) to the executors using spark.sparkContext.addPyFile(''). However:

  1. I find this a bit long-winded (esp. if you need to zip folders etc...)
  2. This is not always easy/possible (e.g. may be using lots of other modules which then also need to be submitted, leading to bit of chain reaction...)
  3. There are some other inconveniences with addPyFile (e.g. autoreload can stop working etc )

So the question is: is there a way to do all of these at the same time:

  • have the logic of the UDF nicely split to several Python functions
  • use the UDF from a different file than where the logic is defined
  • not needing to submit any dependencies using addPyFile

Bonus points for clarifying how this works/why this doesn't work!


For small (one or two local files) dependencies you can use --py-files and enumerate them, with something bigger or more dependencies - it's better to pack it in a zip or egg file.


def my_function(*args, **kwargs):
    # code


from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from udfs import my_function

sc = SparkContext()
spark = SparkSession(sc)
my_udf = udf(my_function)

df = spark.createDataFrame([(1, "a"), (2, "b")])
df.withColumn("my_f", my_udf("..."))

For run:

pyspark --py-files /path/to/
# or
spark-submit --py-files /path/to/

If you have written your own Python module or even third-party modules (which don't need C compilation), I personally needed it with geoip2, it's better to create a zip or egg file.

# pip with -t install all modules and dependencies in directory `src`
pip install geoip2 -t ./src
# Or from local directory
pip install ./my_module -t ./src

# Best is 
pip install -r requirements.txt -t ./src

# If you need add some additionals files
cp ./some_scripts/* ./src/

# And pack it
cd ./src
zip -r ../ .
cd ..

pyspark --py-files
spark-submit --py-files

Be careful when using pyspark --master yarn (possibly with other non-local master options), in pyspark shell with --py-files:

>>> import sys
>>> sys.path.insert(0, '/path/to/')  # You can use relative path: .insert(0, '')
>>> import MyModule  #

EDIT - The answer on question of how to get functions on executors without addPyFile () and --py-files:

It is necessary to have a given file with functions on individual executors. And reachable through PATH env. Therefore, I would probably write a Python Module, which I then install on the executors and was available in the environment.


Maybe try organizing your methods inside a class as follows:

class temp_class:
    def nested_f(self, x):
      return x + 1

    def main_f(self, x):
      return self.nested_f(x) + 1

This may work!!

