问题
It's possible to use a external library like textdistance inside pandas_udf? I have tried and I get this error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I have tried with Spark version 2.3.1.
回答1:
You can package the textdistance
together with your own code (use setup.py and bdist_egg
to build an egg
file), and specify the final package with option --py-files
while you run spark.
btw, the error message doesn't seem to relate with textdistance
at all.
回答2:
You can use a Spark UDF, for example to implement the Ratcliff-Obershelp function:
import textdistance
def my_ro(s1,s2):
d = textdistance.ratcliff_obershelp(s1,s2)
return d
spark.udf.register("my_ro", my_ro, FloatType())
spark.sql("SELECT word1, word2, my_ro(word1,word2) as ro FROM spark_df")\
.show(100,False)
来源:https://stackoverflow.com/questions/57706352/use-external-library-in-pandas-udf-in-pyspark