pyspark udf print row being analyzed

问题

I have a problem inside a pyspark udf function and I want to print the number of the row generating the problem.

I tried to count the rows using the equivalent of "static variable" in Python so that when the udf is called with a new row, a counter is incremented. However, it is not working:

import pyspark.sql.functions as F
def myF(input):
    myF.lineNumber += 1
    if (somethingBad):
        print(myF.lineNumber)
    return res

myF.lineNumber = 0

myF_udf =  F.udf(myF, StringType())

How can I count the number of times a udf is called in order to find the number of the row generating the problem in pyspark?

回答1:

udfs are executed at workers, so the print statements inside them wont show up in the output(which is from the driver). The best way to handle issues with UDFs is to change the return type of the UDF to a struct or a list and pass the error information along with the returned output. In the code below I am just adding the error info to the string res that you were returning originally.

import pyspark.sql.functions as F
def myF(input):
  myF.lineNumber += 1
  if (somethingBad):
    res += 'Error in line {}".format(myF.lineNumber)
  return res

myF.lineNumber = 0

myF_udf =  F.udf(myF, StringType())

来源：https://stackoverflow.com/questions/54252682/pyspark-udf-print-row-being-analyzed

标签

python

python-3.x

pyspark

user-defined-functions

static-variables

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!