Use external library in pandas_udf in pyspark

那年仲夏 提交于 2021-01-28 18:31:40

问题


It's possible to use a external library like textdistance inside pandas_udf? I have tried and I get this error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I have tried with Spark version 2.3.1.


回答1:


You can package the textdistance together with your own code (use setup.py and bdist_egg to build an egg file), and specify the final package with option --py-files while you run spark.

btw, the error message doesn't seem to relate with textdistance at all.




回答2:


You can use a Spark UDF, for example to implement the Ratcliff-Obershelp function:

import textdistance

def my_ro(s1,s2):
  d = textdistance.ratcliff_obershelp(s1,s2)
  return d

spark.udf.register("my_ro", my_ro, FloatType())

spark.sql("SELECT word1, word2, my_ro(word1,word2) as ro FROM spark_df")\
.show(100,False)


来源:https://stackoverflow.com/questions/57706352/use-external-library-in-pandas-udf-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!