问题
I have a problem inside a pyspark udf function and I want to print the number of the row generating the problem.
I tried to count the rows using the equivalent of "static variable" in Python so that when the udf is called with a new row, a counter is incremented. However, it is not working:
import pyspark.sql.functions as F
def myF(input):
myF.lineNumber += 1
if (somethingBad):
print(myF.lineNumber)
return res
myF.lineNumber = 0
myF_udf = F.udf(myF, StringType())
How can I count the number of times a udf is called in order to find the number of the row generating the problem in pyspark?
回答1:
udfs are executed at workers, so the print statements inside them wont show up in the output(which is from the driver). The best way to handle issues with UDFs is to change the return type of the UDF to a struct or a list and pass the error information along with the returned output. In the code below I am just adding the error info to the string res that you were returning originally.
import pyspark.sql.functions as F
def myF(input):
myF.lineNumber += 1
if (somethingBad):
res += 'Error in line {}".format(myF.lineNumber)
return res
myF.lineNumber = 0
myF_udf = F.udf(myF, StringType())
来源:https://stackoverflow.com/questions/54252682/pyspark-udf-print-row-being-analyzed